-
Stefy Lanza (nextime / spora ) authored
Per-model ds4 tuning (these vary by quant/size/context, so they belong on the model, not globally): - Optional `ds4` block on a model entry overrides the global Ds4Config for ssd_streaming / expert_cache_reserve_gb / extra_args / extra_env; unset fields inherit the global config (the default/template). Ds4Backend looks up its own model entry and applies the overrides via dataclasses.replace. - admin: api_model_configure accepts + normalizes the per-model `ds4` block, dropping it when empty. - models page: a "ds4 streaming" section shown only when ds4 is enabled globally and the model is a deepseek4; n_ctx stays the context knob. Fix garbled / truncated ds4 replies: the streaming reader used iter_lines(decode_unicode=True), which decodes each network chunk independently and corrupts a multibyte UTF-8 char split across chunks ('—' -> 'â'); the broken JSON then made json.loads fail and the token was silently dropped (truncated tails). Parse the SSE byte stream and split on the b"\n" byte (never inside a UTF-8 sequence), decoding whole lines; also flush a final newline-less line. UI: slow-reply notice reworded to "Waiting for model reply..." with a trailing newline so the real reply starts on its own line. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
6a111627