codai/backends · f97459fc5fd2c2635e352e3db6db4df87e9ea76a · nexlab / coderai

backend: per-model kv_offload flag to keep the KV cache in host RAM · 4754beff

Stefy Lanza (nextime / spora ) authored Jun 20, 2026

Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which
won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it
can keep it in system RAM via --no-kv-offload. Expose that as a per-model
kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass
offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of
slower decode (KV ops cross PCIe). Also allow the key in the admin model-config
endpoint so it's persistable from the UI.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

4754beff

Name	Last commit	Last update
..
__init__.py		Loading commit data...
base.py		Loading commit data...
cuda.py		Loading commit data...
ds4.py		Loading commit data...
vulkan.py		Loading commit data...