• Stefy Lanza (nextime / spora )'s avatar
    ds4: auto-downloaded weights land in coderai GGUF cache + show on models page · ef106ba1
    Stefy Lanza (nextime / spora ) authored
    When ds4.auto_download is enabled and a deepseek4 request resolves no local
    GGUF, the downloaded weight variant is now relocated into coderai's GGUF cache
    (get_model_cache_dir; move on same FS, symlink across devices) and registered
    in models.json as a text_models entry that mimics the requested ("failed")
    model's config — backend auto, on-request, enabled and visible (removed from
    unloaded/to_download). model_name is threaded ds4 backend → ensure_service →
    ensure_model so the registration mirrors the right entry.
    
    Also: settings "Extra ds4-server args" hint/placeholder updated to reflect the
    auto --kv-disk-dir and SSD-streaming expert-cache sizing
    (--ssd-streaming-cache-experts), noting Q2_K can fail ds4's CUDA prefill.
    
    Diagnosis (no code change): ds4-server's "cuda prefill failed" on the 93GB
    Q2_K variant is a quant-specific ds4 CUDA bug — the 154GB Q4_K completes
    prefill fine (verified: "prompt done 434s" vs Q2_K instant failure), with
    15.8GB VRAM free either way (not OOM, not cache budget, not coderai).
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    ef106ba1
settings.html 45.9 KB