-
Stefy Lanza (nextime / spora ) authored
Smart context caching (both text backends): - Per-instance generation lock so pooled concurrent requests can't corrupt a shared KV cache (GGUF + HF, incl. streaming worker thread). - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB). - HF: replace single exact-text KV slot with an LRU of token-prefix slots + token-level longest-common-prefix + DynamicCache clone/crop (handles mid-history edits); kv_cache_slots (default 3). - Session-affinity routing in ModelInstancePool.acquire(session_key); key from user/X-Session-Id else a stable prefix hash. - RAM-pressure ladder drops reclaimable prefix caches before evicting models. VRAM fix: - Auto-fit check no longer double-counts the KV/activation reserve when expected_vram_gb is already a peak estimate — borderline models (e.g. gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing device_map offload. GPTQ/AWQ fast-kernel quant backend (HF path): - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint cache, on-demand background quantize job (falls back to bnb if unsupported). - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized checkpoint with Marlin/ExLlama when present, else bitsandbytes. - Admin endpoints + "Quantize to 4-bit" button with live status on the model page. - requirements-nvidia.txt documents the from-source install + numpy caveat. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
990f9471