• Stefy Lanza (nextime / spora )'s avatar
    feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471
    Stefy Lanza (nextime / spora ) authored
    Smart context caching (both text backends):
    - Per-instance generation lock so pooled concurrent requests can't corrupt a
      shared KV cache (GGUF + HF, incl. streaming worker thread).
    - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
    - HF: replace single exact-text KV slot with an LRU of token-prefix slots +
      token-level longest-common-prefix + DynamicCache clone/crop (handles
      mid-history edits); kv_cache_slots (default 3).
    - Session-affinity routing in ModelInstancePool.acquire(session_key); key from
      user/X-Session-Id else a stable prefix hash.
    - RAM-pressure ladder drops reclaimable prefix caches before evicting models.
    
    VRAM fix:
    - Auto-fit check no longer double-counts the KV/activation reserve when
      expected_vram_gb is already a peak estimate — borderline models (e.g.
      gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
      device_map offload.
    
    GPTQ/AWQ fast-kernel quant backend (HF path):
    - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
      cache, on-demand background quantize job (falls back to bnb if unsupported).
    - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
      checkpoint with Marlin/ExLlama when present, else bitsandbytes.
    - Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
    - requirements-nvidia.txt documents the from-source install + numpy caveat.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    990f9471
Name
Last commit
Last update
..
admin Loading commit data...
api Loading commit data...
backends Loading commit data...
broker Loading commit data...
frontproxy Loading commit data...
models Loading commit data...
openai Loading commit data...
pydantic Loading commit data...
queue Loading commit data...
tasks Loading commit data...
__init__.py Loading commit data...
cli.py Loading commit data...
config.py Loading commit data...
main.py Loading commit data...
platform_paths.py Loading commit data...