• Stefy Lanza (nextime / spora )'s avatar
    feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471
    Stefy Lanza (nextime / spora ) authored
    Smart context caching (both text backends):
    - Per-instance generation lock so pooled concurrent requests can't corrupt a
      shared KV cache (GGUF + HF, incl. streaming worker thread).
    - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
    - HF: replace single exact-text KV slot with an LRU of token-prefix slots +
      token-level longest-common-prefix + DynamicCache clone/crop (handles
      mid-history edits); kv_cache_slots (default 3).
    - Session-affinity routing in ModelInstancePool.acquire(session_key); key from
      user/X-Session-Id else a stable prefix hash.
    - RAM-pressure ladder drops reclaimable prefix caches before evicting models.
    
    VRAM fix:
    - Auto-fit check no longer double-counts the KV/activation reserve when
      expected_vram_gb is already a peak estimate — borderline models (e.g.
      gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
      device_map offload.
    
    GPTQ/AWQ fast-kernel quant backend (HF path):
    - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
      cache, on-demand background quantize job (falls back to bnb if unsupported).
    - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
      checkpoint with Marlin/ExLlama when present, else bitsandbytes.
    - Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
    - requirements-nvidia.txt documents the from-source install + numpy caveat.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    990f9471
manager.py 173 KB