-
Stefy Lanza (nextime / spora ) authored
Some quantized fine-tunes (seen with an "Aggressive" Qwen3.6-35B Q4_K_M) collapse into a runaway repetition loop — emitting a malformed parallel tool-call flood of 1700+ tokens that never terminates — when top_p=1.0 and no repetition penalty are in effect (exactly the conditions Qwen's own docs warn cause endless repetitions). Two fixes: 1. Anti-loop generation stop in stream_chat_response: a model-agnostic detector normalises away the variable parts of the tail (quoted strings, filesystem paths, whitespace) so a loop whose only per-cycle difference is an arg/path still reads as periodic, then breaks generation when a short structural unit repeats >=5x back-to-back. Tuned to not trip on prose, repetitive code, or a legit handful of distinct tool calls. 2. Honor client-supplied repetition controls. The chat paths previously forwarded only temperature/top_p, silently dropping repeat/presence/frequency penalty — so a caller (e.g. Kilo) setting them per-model had no effect. Plumb them through generate_chat_stream / generate_chat to both backends (cuda already accepts them; vulkan now does too) with graceful signature fallbacks. Defaults are no-ops, so unset clients are unaffected. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
a535c27f