feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend
Smart context caching (both text backends):
- Per-instance generation lock so pooled concurrent requests can't corrupt a
shared KV cache (GGUF + HF, incl. streaming worker thread).
- GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
- HF: replace single exact-text KV slot with an LRU of token-prefix slots +
token-level longest-common-prefix + DynamicCache clone/crop (handles
mid-history edits); kv_cache_slots (default 3).
- Session-affinity routing in ModelInstancePool.acquire(session_key); key from
user/X-Session-Id else a stable prefix hash.
- RAM-pressure ladder drops reclaimable prefix caches before evicting models.
VRAM fix:
- Auto-fit check no longer double-counts the KV/activation reserve when
expected_vram_gb is already a peak estimate — borderline models (e.g.
gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
device_map offload.
GPTQ/AWQ fast-kernel quant backend (HF path):
- New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
cache, on-demand background quantize job (falls back to bnb if unsupported).
- quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
checkpoint with Marlin/ExLlama when present, else bitsandbytes.
- Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
- requirements-nvidia.txt documents the from-source install + numpy caveat.
Co-Authored-By:
Claude Opus 4.8 <noreply@anthropic.com>
Showing
This diff is collapsed.
This diff is collapsed.
codai/models/quant.py
0 → 100644
Please
register
or
sign in
to comment