fix(ds4): give ds4 models exclusive VRAM (evict others) to stop expert-cache starvation
ds4-server streams MoE experts and wants the whole GPU for its expert cache, but
coderai's modest VRAM estimate for a ds4 model let it co-reside another model —
starving the cache so ds4's layer-0 FFN expert encode failed ("gpu layer 0 ffn
batch encode failed"). When loading a ds4 model on demand, unload all other models
first so ds4-server gets the full GPU.
Co-Authored-By:
Claude Opus 4.8 <noreply@anthropic.com>
Showing
Please
register
or
sign in
to comment