• Stefy Lanza (nextime / spora )'s avatar
    backend: per-model kv_offload flag to keep the KV cache in host RAM · 4754beff
    Stefy Lanza (nextime / spora ) authored
    Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which
    won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it
    can keep it in system RAM via --no-kv-offload. Expose that as a per-model
    kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass
    offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of
    slower decode (KV ops cross PCIe). Also allow the key in the admin model-config
    endpoint so it's persistable from the UI.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    4754beff
Name
Last commit
Last update
..
admin Loading commit data...
api Loading commit data...
backends Loading commit data...
broker Loading commit data...
frontproxy Loading commit data...
models Loading commit data...
openai Loading commit data...
pydantic Loading commit data...
queue Loading commit data...
tasks Loading commit data...
__init__.py Loading commit data...
cli.py Loading commit data...
config.py Loading commit data...
main.py Loading commit data...
platform_paths.py Loading commit data...