text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming
Auto-compaction never triggered: multi_model_manager.config stores the
whitelisted build_runtime_kwargs() dict, which drops the per-model
auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction
always read the global default (False) and returned None. Read the keys via a
_raw_cfg fallback so per-model compaction config is honoured.
Also rework the over-context handling to count the reply reservation, since the
reply is generated into the same window (prompt + max_tokens <= n_ctx). Four
layers, cheapest first:
1. fits as-is -> nothing
2. overflow within tol -> trim max_tokens to fit (lossless)
3. beyond tol & big prompt -> compact history (drop/summarize)
4. single message too big -> slice it (summarize its middle, keep head/tail)
The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact
n_ctx edge could still overflow; inflate the estimate by a configurable
estimate_safety (default 1.15) for all physical-fit decisions.
New CompactionConfig knobs (per-model overridable): tolerance_pct (20),
min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back
to both the streaming and non-streaming generation paths.
Co-Authored-By:
Claude Opus 4.8 <noreply@anthropic.com>
Showing
This diff is collapsed.
Please
register
or
sign in
to comment