Files · 990f9471fb92f898f48f3579d0c4cbb65e036dce · nexlab / coderai

feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Smart context caching (both text backends):
- Per-instance generation lock so pooled concurrent requests can't corrupt a
  shared KV cache (GGUF + HF, incl. streaming worker thread).
- GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
- HF: replace single exact-text KV slot with an LRU of token-prefix slots +
  token-level longest-common-prefix + DynamicCache clone/crop (handles
  mid-history edits); kv_cache_slots (default 3).
- Session-affinity routing in ModelInstancePool.acquire(session_key); key from
  user/X-Session-Id else a stable prefix hash.
- RAM-pressure ladder drops reclaimable prefix caches before evicting models.

VRAM fix:
- Auto-fit check no longer double-counts the KV/activation reserve when
  expected_vram_gb is already a peak estimate — borderline models (e.g.
  gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
  device_map offload.

GPTQ/AWQ fast-kernel quant backend (HF path):
- New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
  cache, on-demand background quantize job (falls back to bnb if unsupported).
- quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
  checkpoint with Marlin/ExLlama when present, else bitsandbytes.
- Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
- requirements-nvidia.txt documents the from-source install + numpy caveat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

990f9471

Name	Last commit	Last update
codai		Loading commit data...
docs		Loading commit data...
packaging		Loading commit data...
samples		Loading commit data...
tests		Loading commit data...
tools		Loading commit data...
.dockerignore		Loading commit data...
.gitignore		Loading commit data...
AI.PROMPT		Loading commit data...
CODERAI_API_DOCUMENTATION.md		Loading commit data...
CoderAI.gif		Loading commit data...
DISTRIBUTION.md		Loading commit data...
LICENSE.md		Loading commit data...
MULTIMODAL_CAPABILITIES.md		Loading commit data...
MULTIMODAL_UI_EXAMPLES.md		Loading commit data...
README.md		Loading commit data...
build-oci.sh		Loading commit data...
build.ps1		Loading commit data...
build.sh		Loading commit data...
coderai		Loading commit data...
coderai-broker-implementation-reference.md		Loading commit data...
coderai-integration.md		Loading commit data...
commands		Loading commit data...
osxbuild.sh		Loading commit data...
package-oci.sh		Loading commit data...
package-tarball.sh		Loading commit data...
requirements-nvidia.txt		Loading commit data...
requirements-vulkan.txt		Loading commit data...
requirements.txt		Loading commit data...
run-oci.sh		Loading commit data...
smoke-test-oci.sh		Loading commit data...
todo.md		Loading commit data...
video_editor.config.json		Loading commit data...

README.md