codai/api/text.py · 913e283aeea99c2894448b6cebaf59e7373bd42d · nexlab / coderai

text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming · 913e283a

Stefy Lanza (nextime / spora ) authored Jun 20, 2026

Auto-compaction never triggered: multi_model_manager.config stores the
whitelisted build_runtime_kwargs() dict, which drops the per-model
auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction
always read the global default (False) and returned None. Read the keys via a
_raw_cfg fallback so per-model compaction config is honoured.

Also rework the over-context handling to count the reply reservation, since the
reply is generated into the same window (prompt + max_tokens <= n_ctx). Four
layers, cheapest first:
  1. fits as-is              -> nothing
  2. overflow within tol     -> trim max_tokens to fit (lossless)
  3. beyond tol & big prompt -> compact history (drop/summarize)
  4. single message too big  -> slice it (summarize its middle, keep head/tail)

The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact
n_ctx edge could still overflow; inflate the estimate by a configurable
estimate_safety (default 1.15) for all physical-fit decisions.

New CompactionConfig knobs (per-model overridable): tolerance_pct (20),
min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back
to both the streaming and non-streaming generation paths.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

913e283a

text.py 143 KB

Replace text.py