codai · 4754beff012b9f52ec3d4635e7e73d1af9c2105a · nexlab / coderai

backend: per-model kv_offload flag to keep the KV cache in host RAM · 4754beff

Stefy Lanza (nextime / spora ) authored Jun 20, 2026

Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which
won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it
can keep it in system RAM via --no-kv-offload. Expose that as a per-model
kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass
offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of
slower decode (KV ops cross PCIe). Also allow the key in the admin model-config
endpoint so it's persistable from the UI.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

4754beff

Name	Last commit	Last update
..
admin		Loading commit data...
api		Loading commit data...
backends		Loading commit data...
broker		Loading commit data...
frontproxy		Loading commit data...
models		Loading commit data...
openai		Loading commit data...
pydantic		Loading commit data...
queue		Loading commit data...
tasks		Loading commit data...
__init__.py		Loading commit data...
cli.py		Loading commit data...
config.py		Loading commit data...
main.py		Loading commit data...
platform_paths.py		Loading commit data...