-
Stefy Lanza (nextime / spora ) authored
Auto-compaction never triggered: multi_model_manager.config stores the whitelisted build_runtime_kwargs() dict, which drops the per-model auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction always read the global default (False) and returned None. Read the keys via a _raw_cfg fallback so per-model compaction config is honoured. Also rework the over-context handling to count the reply reservation, since the reply is generated into the same window (prompt + max_tokens <= n_ctx). Four layers, cheapest first: 1. fits as-is -> nothing 2. overflow within tol -> trim max_tokens to fit (lossless) 3. beyond tol & big prompt -> compact history (drop/summarize) 4. single message too big -> slice it (summarize its middle, keep head/tail) The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact n_ctx edge could still overflow; inflate the estimate by a configurable estimate_safety (default 1.15) for all physical-fit decisions. New CompactionConfig knobs (per-model overridable): tolerance_pct (20), min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back to both the streaming and non-streaming generation paths. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
913e283a