text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming

Auto-compaction never triggered: multi_model_manager.config stores the whitelisted build_runtime_kwargs() dict, which drops the per-model auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction always read the global default (False) and returned None. Read the keys via a _raw_cfg fallback so per-model compaction config is honoured. Also rework the over-context handling to count the reply reservation, since the reply is generated into the same window (prompt + max_tokens <= n_ctx). Four layers, cheapest first: 1. fits as-is -> nothing 2. overflow within tol -> trim max_tokens to fit (lossless) 3. beyond tol & big prompt -> compact history (drop/summarize) 4. single message too big -> slice it (summarize its middle, keep head/tail) The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact n_ctx edge could still overflow; inflate the estimate by a configurable estimate_safety (default 1.15) for all physical-fit decisions. New CompactionConfig knobs (per-model overridable): tolerance_pct (20), min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back to both the streaming and non-streaming generation paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming
Auto-compaction never triggered: multi_model_manager.config stores the whitelisted build_runtime_kwargs() dict, which drops the per-model auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction always read the global default (False) and returned None. Read the keys via a _raw_cfg fallback so per-model compaction config is honoured. Also rework the over-context handling to count the reply reservation, since the reply is generated into the same window (prompt + max_tokens <= n_ctx). Four layers, cheapest first: 1. fits as-is -> nothing 2. overflow within tol -> trim max_tokens to fit (lossless) 3. beyond tol & big prompt -> compact history (drop/summarize) 4. single message too big -> slice it (summarize its middle, keep head/tail) The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact n_ctx edge could still overflow; inflate the estimate by a configurable estimate_safety (default 1.15) for all physical-fit decisions. New CompactionConfig knobs (per-model overridable): tolerance_pct (20), min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back to both the streaming and non-streaming generation paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
913e283a · Stefy Lanza (nextime / spora ) · 34d666d6 · 913e283a · 913e283a
Commit 913e283a authored Jun 20, 2026 by Stefy Lanza (nextime / spora )
Expand all Hide whitespace changes
Inline Side-by-side

Showing with 242 additions and 61 deletions

text.py codai/api/text.py +221 -54

config.py codai/config.py +21 -7

No files found.
--- a/codai/api/text.py
+++ b/codai/api/text.py
--- a/codai/config.py
+++ b/codai/config.py
@@ -222,17 +222,28 @@ class CompactionConfig:
    """Global defaults for auto-compaction of an over-long chat history.
    Per-model settings in a models.json entry (``auto_compact``,
-    ``auto_compact_pct``, ``auto_compact_strategy``, ``auto_compact_model``)
+    ``auto_compact_pct``, ``auto_compact_strategy``, ``auto_compact_model``,
-    OVERRIDE the values here; when a model leaves one unset, the global default
+    ``auto_compact_tolerance_pct``) OVERRIDE the values here; when a model leaves
-    below applies. ``model`` selects which model performs the summarization for
+    one unset, the global default below applies. ``model`` selects which model
-    the ``summarize`` strategy — empty means use the same model that serves the
+    performs the summarization for the ``summarize`` strategy — empty means use
-    request. Pointing it at a smaller/faster model lets that model summarize the
+    the same model that serves the request. Pointing it at a smaller/faster model
-    old turns while the big model answers; the dropped history is chunked to fit
+    lets that model summarize the old turns while the big model answers; the
-    the chosen summarizer's own context window before it is summarized."""
+    dropped history is chunked to fit the chosen summarizer's own context window
+    before it is summarized.
+    The over-context check counts the prompt PLUS the request's ``max_tokens``
+    (the reply is generated into the same window). When that total overflows
+    ``n_ctx`` by no more than ``tolerance_pct`` we leave history alone and just
+    trim ``max_tokens`` to fit (lossless); beyond that we compact, and as a last
+    resort slice any single message still larger than the target. ``min_output``
+    is the smallest reply (in tokens) we always try to leave room for."""
    enabled: bool = False
    pct: int = 85                        # compact when the prompt reaches this % of n_ctx
    strategy: str = "drop_oldest"        # drop_oldest | keep_head_tail | summarize
    model: str = ""                      # model id/alias that summarizes; "" = same as request
+    tolerance_pct: int = 20              # accept prompt+max_tokens up to this % over n_ctx (trim instead of compact)
+    min_output: int = 512                # always try to leave at least this many tokens for the reply
+    estimate_safety: float = 1.15        # inflate the chars/4 prompt estimate by this factor for fit/trim (it undercounts code/JSON)
 @dataclass
@@ -666,6 +677,9 @@ class ConfigManager:
                "pct": self.config.compaction.pct,
                "strategy": self.config.compaction.strategy,
                "model": self.config.compaction.model,
+                "tolerance_pct": self.config.compaction.tolerance_pct,
+                "min_output": self.config.compaction.min_output,
+                "estimate_safety": self.config.compaction.estimate_safety,
            },
            "broker": {
                "enabled": self.config.broker.enabled,