-
Stefy Lanza (nextime / spora ) authored
Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it can keep it in system RAM via --no-kv-offload. Expose that as a per-model kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of slower decode (KV ops cross PCIe). Also allow the key in the admin model-config endpoint so it's persistable from the UI. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
4754beff