text: stop runaway tool-call loops + honor client repetition penalties
Some quantized fine-tunes (seen with an "Aggressive" Qwen3.6-35B Q4_K_M) collapse
into a runaway repetition loop — emitting a malformed parallel tool-call flood of
1700+ tokens that never terminates — when top_p=1.0 and no repetition penalty are
in effect (exactly the conditions Qwen's own docs warn cause endless repetitions).
Two fixes:
1. Anti-loop generation stop in stream_chat_response: a model-agnostic detector
normalises away the variable parts of the tail (quoted strings, filesystem
paths, whitespace) so a loop whose only per-cycle difference is an arg/path
still reads as periodic, then breaks generation when a short structural unit
repeats >=5x back-to-back. Tuned to not trip on prose, repetitive code, or a
legit handful of distinct tool calls.
2. Honor client-supplied repetition controls. The chat paths previously forwarded
only temperature/top_p, silently dropping repeat/presence/frequency penalty —
so a caller (e.g. Kilo) setting them per-model had no effect. Plumb them through
generate_chat_stream / generate_chat to both backends (cuda already accepts
them; vulkan now does too) with graceful signature fallbacks. Defaults are
no-ops, so unset clients are unaffected.
Co-Authored-By:
Claude Opus 4.8 <noreply@anthropic.com>
Showing
Please
register
or
sign in
to comment