• Stefy Lanza (nextime / spora )'s avatar
    text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming · 913e283a
    Stefy Lanza (nextime / spora ) authored
    Auto-compaction never triggered: multi_model_manager.config stores the
    whitelisted build_runtime_kwargs() dict, which drops the per-model
    auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction
    always read the global default (False) and returned None. Read the keys via a
    _raw_cfg fallback so per-model compaction config is honoured.
    
    Also rework the over-context handling to count the reply reservation, since the
    reply is generated into the same window (prompt + max_tokens <= n_ctx). Four
    layers, cheapest first:
      1. fits as-is              -> nothing
      2. overflow within tol     -> trim max_tokens to fit (lossless)
      3. beyond tol & big prompt -> compact history (drop/summarize)
      4. single message too big  -> slice it (summarize its middle, keep head/tail)
    
    The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact
    n_ctx edge could still overflow; inflate the estimate by a configurable
    estimate_safety (default 1.15) for all physical-fit decisions.
    
    New CompactionConfig knobs (per-model overridable): tolerance_pct (20),
    min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back
    to both the streaming and non-streaming generation paths.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    913e283a
text.py 143 KB