-
Stefy Lanza (nextime / spora ) authored
When ds4.auto_download is enabled and a deepseek4 request resolves no local GGUF, the downloaded weight variant is now relocated into coderai's GGUF cache (get_model_cache_dir; move on same FS, symlink across devices) and registered in models.json as a text_models entry that mimics the requested ("failed") model's config — backend auto, on-request, enabled and visible (removed from unloaded/to_download). model_name is threaded ds4 backend → ensure_service → ensure_model so the registration mirrors the right entry. Also: settings "Extra ds4-server args" hint/placeholder updated to reflect the auto --kv-disk-dir and SSD-streaming expert-cache sizing (--ssd-streaming-cache-experts), noting Q2_K can fail ds4's CUDA prefill. Diagnosis (no code change): ds4-server's "cuda prefill failed" on the 93GB Q2_K variant is a quant-specific ds4 CUDA bug — the 154GB Q4_K completes prefill fine (verified: "prompt done 434s" vs Q2_K instant failure), with 15.8GB VRAM free either way (not OOM, not cache budget, not coderai). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
ef106ba1