• Stefy Lanza (nextime / spora )'s avatar
    backend: per-model kv_offload flag to keep the KV cache in host RAM · 4754beff
    Stefy Lanza (nextime / spora ) authored
    Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which
    won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it
    can keep it in system RAM via --no-kv-offload. Expose that as a per-model
    kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass
    offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of
    slower decode (KV ops cross PCIe). Also allow the key in the admin model-config
    endpoint so it's persistable from the UI.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    4754beff
Name
Last commit
Last update
..
__init__.py Loading commit data...
base.py Loading commit data...
cuda.py Loading commit data...
ds4.py Loading commit data...
vulkan.py Loading commit data...