text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming

Auto-compaction never triggered: multi_model_manager.config stores the
whitelisted build_runtime_kwargs() dict, which drops the per-model
auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction
always read the global default (False) and returned None. Read the keys via a
_raw_cfg fallback so per-model compaction config is honoured.

Also rework the over-context handling to count the reply reservation, since the
reply is generated into the same window (prompt + max_tokens <= n_ctx). Four
layers, cheapest first:
  1. fits as-is              -> nothing
  2. overflow within tol     -> trim max_tokens to fit (lossless)
  3. beyond tol & big prompt -> compact history (drop/summarize)
  4. single message too big  -> slice it (summarize its middle, keep head/tail)

The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact
n_ctx edge could still overflow; inflate the estimate by a configurable
estimate_safety (default 1.15) for all physical-fit decisions.

New CompactionConfig knobs (per-model overridable): tolerance_pct (20),
min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back
to both the streaming and non-streaming generation paths.
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent 34d666d6
This diff is collapsed.
...@@ -222,17 +222,28 @@ class CompactionConfig: ...@@ -222,17 +222,28 @@ class CompactionConfig:
"""Global defaults for auto-compaction of an over-long chat history. """Global defaults for auto-compaction of an over-long chat history.
Per-model settings in a models.json entry (``auto_compact``, Per-model settings in a models.json entry (``auto_compact``,
``auto_compact_pct``, ``auto_compact_strategy``, ``auto_compact_model``) ``auto_compact_pct``, ``auto_compact_strategy``, ``auto_compact_model``,
OVERRIDE the values here; when a model leaves one unset, the global default ``auto_compact_tolerance_pct``) OVERRIDE the values here; when a model leaves
below applies. ``model`` selects which model performs the summarization for one unset, the global default below applies. ``model`` selects which model
the ``summarize`` strategy — empty means use the same model that serves the performs the summarization for the ``summarize`` strategy — empty means use
request. Pointing it at a smaller/faster model lets that model summarize the the same model that serves the request. Pointing it at a smaller/faster model
old turns while the big model answers; the dropped history is chunked to fit lets that model summarize the old turns while the big model answers; the
the chosen summarizer's own context window before it is summarized.""" dropped history is chunked to fit the chosen summarizer's own context window
before it is summarized.
The over-context check counts the prompt PLUS the request's ``max_tokens``
(the reply is generated into the same window). When that total overflows
``n_ctx`` by no more than ``tolerance_pct`` we leave history alone and just
trim ``max_tokens`` to fit (lossless); beyond that we compact, and as a last
resort slice any single message still larger than the target. ``min_output``
is the smallest reply (in tokens) we always try to leave room for."""
enabled: bool = False enabled: bool = False
pct: int = 85 # compact when the prompt reaches this % of n_ctx pct: int = 85 # compact when the prompt reaches this % of n_ctx
strategy: str = "drop_oldest" # drop_oldest | keep_head_tail | summarize strategy: str = "drop_oldest" # drop_oldest | keep_head_tail | summarize
model: str = "" # model id/alias that summarizes; "" = same as request model: str = "" # model id/alias that summarizes; "" = same as request
tolerance_pct: int = 20 # accept prompt+max_tokens up to this % over n_ctx (trim instead of compact)
min_output: int = 512 # always try to leave at least this many tokens for the reply
estimate_safety: float = 1.15 # inflate the chars/4 prompt estimate by this factor for fit/trim (it undercounts code/JSON)
@dataclass @dataclass
...@@ -666,6 +677,9 @@ class ConfigManager: ...@@ -666,6 +677,9 @@ class ConfigManager:
"pct": self.config.compaction.pct, "pct": self.config.compaction.pct,
"strategy": self.config.compaction.strategy, "strategy": self.config.compaction.strategy,
"model": self.config.compaction.model, "model": self.config.compaction.model,
"tolerance_pct": self.config.compaction.tolerance_pct,
"min_output": self.config.compaction.min_output,
"estimate_safety": self.config.compaction.estimate_safety,
}, },
"broker": { "broker": {
"enabled": self.config.broker.enabled, "enabled": self.config.broker.enabled,
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment