codai/models/manager.py · 6d053dc1be04d84134ddfb8b00f1a247cad4b9b6 · nexlab / coderai

feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Smart context caching (both text backends):
- Per-instance generation lock so pooled concurrent requests can't corrupt a
  shared KV cache (GGUF + HF, incl. streaming worker thread).
- GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
- HF: replace single exact-text KV slot with an LRU of token-prefix slots +
  token-level longest-common-prefix + DynamicCache clone/crop (handles
  mid-history edits); kv_cache_slots (default 3).
- Session-affinity routing in ModelInstancePool.acquire(session_key); key from
  user/X-Session-Id else a stable prefix hash.
- RAM-pressure ladder drops reclaimable prefix caches before evicting models.

VRAM fix:
- Auto-fit check no longer double-counts the KV/activation reserve when
  expected_vram_gb is already a peak estimate — borderline models (e.g.
  gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
  device_map offload.

GPTQ/AWQ fast-kernel quant backend (HF path):
- New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
  cache, on-demand background quantize job (falls back to bnb if unsupported).
- quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
  checkpoint with Marlin/ExLlama when present, else bitsandbytes.
- Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
- requirements-nvidia.txt documents the from-source install + numpy caveat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

990f9471

manager.py 173 KB

Replace manager.py