• Stefy Lanza (nextime / spora )'s avatar
    feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471
    Stefy Lanza (nextime / spora ) authored
    Smart context caching (both text backends):
    - Per-instance generation lock so pooled concurrent requests can't corrupt a
      shared KV cache (GGUF + HF, incl. streaming worker thread).
    - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
    - HF: replace single exact-text KV slot with an LRU of token-prefix slots +
      token-level longest-common-prefix + DynamicCache clone/crop (handles
      mid-history edits); kv_cache_slots (default 3).
    - Session-affinity routing in ModelInstancePool.acquire(session_key); key from
      user/X-Session-Id else a stable prefix hash.
    - RAM-pressure ladder drops reclaimable prefix caches before evicting models.
    
    VRAM fix:
    - Auto-fit check no longer double-counts the KV/activation reserve when
      expected_vram_gb is already a peak estimate — borderline models (e.g.
      gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
      device_map offload.
    
    GPTQ/AWQ fast-kernel quant backend (HF path):
    - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
      cache, on-demand background quantize job (falls back to bnb if unsupported).
    - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
      checkpoint with Marlin/ExLlama when present, else bitsandbytes.
    - Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
    - requirements-nvidia.txt documents the from-source install + numpy caveat.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    990f9471
Name
Last commit
Last update
codai Loading commit data...
docs Loading commit data...
packaging Loading commit data...
samples Loading commit data...
tests Loading commit data...
tools Loading commit data...
.dockerignore Loading commit data...
.gitignore Loading commit data...
AI.PROMPT Loading commit data...
CODERAI_API_DOCUMENTATION.md Loading commit data...
CoderAI.gif Loading commit data...
DISTRIBUTION.md Loading commit data...
LICENSE.md Loading commit data...
MULTIMODAL_CAPABILITIES.md Loading commit data...
MULTIMODAL_UI_EXAMPLES.md Loading commit data...
README.md Loading commit data...
build-oci.sh Loading commit data...
build.ps1 Loading commit data...
build.sh Loading commit data...
coderai Loading commit data...
coderai-broker-implementation-reference.md Loading commit data...
coderai-integration.md Loading commit data...
commands Loading commit data...
osxbuild.sh Loading commit data...
package-oci.sh Loading commit data...
package-tarball.sh Loading commit data...
requirements-nvidia.txt Loading commit data...
requirements-vulkan.txt Loading commit data...
requirements.txt Loading commit data...
run-oci.sh Loading commit data...
smoke-test-oci.sh Loading commit data...
todo.md Loading commit data...
video_editor.config.json Loading commit data...