feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file

- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp; the model_id alias still routes for the download case. - ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the managed service per file). No fixed-variant assumption. - Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx). - New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream MoE experts from SSD/disk), model_path (explicit -m override), and auto_download (OFF by default — only serve GGUFs already present; error clearly instead of silently pulling tens of GB; opt in to fetch model_variant). - AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml ops) → ds4 for now; and ds4 routing/offload/text-only specifics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file
- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp; the model_id alias still routes for the download case. - ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the managed service per file). No fixed-variant assumption. - Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx). - New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream MoE experts from SSD/disk), model_path (explicit -m override), and auto_download (OFF by default — only serve GGUFs already present; error clearly instead of silently pulling tens of GB; opt in to fetch model_variant). - AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml ops) → ds4 for now; and ds4 routing/offload/text-only specifics. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
6a153c58 · Stefy Lanza (nextime / spora ) · 3834ecf5 · 6a153c58 · 6a153c58 · 6a153c58
Commit 6a153c58 authored Jun 19, 2026 by Stefy Lanza (nextime / spora )
7 changed files
--- a/AI.PROMPT
+++ b/AI.PROMPT
@@ -361,3 +361,44 @@ Incremental update (FAST, ~30 s — code-only changes, NO bundle recopy):
  - CAUTION: COPY adds/overwrites but does NOT delete files removed from the
    repo; the cleanup RUN prunes only known-stale paths (.git/venv*/dist/...). A
    source file deleted from codai/ lingers in the overlay until a full rebuild.
+
+================================================================================
+## DeepSeek-V4: llama.cpp support is pending — ds4 for now
+================================================================================
+
+DeepSeek-V4 GGUFs use the `deepseek4` architecture (keys like
+`deepseek4.hash_layer_count`, `deepseek4.nextn_predict_layers`). The bundled
+llama.cpp (0.3.30, ggml 0.15.1) supports up to `deepseek32` (V3.2) — NOT
+`deepseek4`, so it fails fast with "Failed to load model from file" (this is an
+ARCHITECTURE gap, not VRAM/offload, and no quantization changes it).
+
+Mainline llama.cpp V4 support is IN PROGRESS but NOT yet merged — it needs new
+ggml ops (a "hyperconnection" op and `GGML_OP_LIGHTNING_INDEXER`) tracked in OPEN
+PRs ggml-org/llama.cpp#23122, #24162, #24231 (#22319 = the model request; #23706
+"can't launch deepseek-v4-flash" = the exact Flash variant). So:
+  - Rebuilding llama.cpp from `master` today will NOT run V4 (ops unmerged).
+  - TODO (revisit soon): once those PRs land, rebuild llama-cpp-python from a
+    tag/commit that includes them (CUDA build), and run V4 GGUFs converted with
+    MAINLINE's convert_hf_to_gguf (so the arch/tensors match) — any quant.
+  - The user's `…-ds4-…Q2_K.gguf` is ds4 (antirez/ds4 / DwarfStar) format; it
+    will only ever load under ds4, not llama.cpp.
+
+ds4 (codai/backends/ds4.py, config.ds4, codai/api/ds4_worker.py) is the native
+DeepSeek-V4 engine: coderai owns ds4-server (clone+build) and proxies to it.
+Routing & model selection (manager.ds4_should_handle):
+  - When ds4.enabled, a model routes to ds4 IFF its GGUF ARCHITECTURE is
+    `deepseek4` (read from the file header via manager._gguf_architecture, cached)
+    — NOT by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp. The
+    config.ds4.model_id alias also routes (for the downloaded-variant case).
+  - ds4-server serves the REQUESTED GGUF: Ds4Backend resolves the model to a local
+    .gguf and launches `ds4-server -m <that file>`. So any deepseek4 GGUF you have
+    is served directly — no per-model setup.
+  - `--ctx` honours the model's per-entry n_ctx (forwarded via kwargs), not just
+    config.ds4.ctx.
+  - config.ds4.ssd_streaming → `--ssd-streaming` (stream MoE experts from SSD/disk;
+    run a 100GB+ model on a small GPU). config.ds4.model_path = explicit -m override.
+  - config.ds4.auto_download is OFF by default: ds4 only serves GGUFs you already
+    have; with no local file it errors (clear message) instead of pulling tens of
+    GB. Enable auto_download to fetch config.ds4.model_variant as a fallback.
+  - ds4-server exposes only TEXT APIs (chat/completions/responses/anthropic) — no
+    image generation; that needs a separate diffusion model.
--- a/codai/admin/routes.py
+++ b/codai/admin/routes.py
@@ -3163,11 +3163,14 @@ async def api_get_settings(username: str = Depends(require_admin)):
            "repo_url": c.ds4.repo_url,
            "install_dir": c.ds4.install_dir,
            "build_target": c.ds4.build_target,
+            "model_path": c.ds4.model_path,
+            "auto_download": c.ds4.auto_download,
            "model_variant": c.ds4.model_variant,
            "model_id": c.ds4.model_id,
            "host": c.ds4.host,
            "port": c.ds4.port,
            "ctx": c.ds4.ctx,
+            "ssd_streaming": c.ds4.ssd_streaming,
            "extra_args": c.ds4.extra_args,
            "auto_build": c.ds4.auto_build,
        },
@@ -3431,6 +3434,12 @@ async def api_save_settings(request: Request, username: str = Depends(require_ad
            c.ds4.install_dir = (d.get("install_dir") or "").strip() or None
        if "build_target" in d:
            c.ds4.build_target = (d.get("build_target") or "auto").strip()
+        if "model_path" in d:
+            c.ds4.model_path = (d.get("model_path") or "").strip()
+        if "auto_download" in d:
+            c.ds4.auto_download = bool(d["auto_download"])
+        if "ssd_streaming" in d:
+            c.ds4.ssd_streaming = bool(d["ssd_streaming"])
        if "model_variant" in d:
            c.ds4.model_variant = (d.get("model_variant") or c.ds4.model_variant).strip()
        if "model_id" in d:

--- a/codai/admin/templates/settings.html
+++ b/codai/admin/templates/settings.html
@@ -394,18 +394,36 @@
      <div class="form-row" style="margin:0">
        <label class="form-label">Model id / alias</label>
        <input type="text" id="s-ds4-model-id" class="form-input" placeholder="deepseek-v4">
-        <span class="form-hint">Requests for this id (or any name containing "deepseek-v4") route to ds4.</span>
+        <span class="form-hint">When enabled, ds4 automatically serves any <b>deepseek4</b>-architecture GGUF you have (detected by reading the file); mainline DeepSeek GGUFs stay on llama.cpp. This alias also routes (for ds4's own downloaded variant).</span>
      </div>
      <div class="form-row" style="margin:0">
-        <label class="form-label">Weight variant</label>
+        <label class="form-label">Weight variant <span class="muted">(download fallback)</span></label>
        <select id="s-ds4-variant" class="form-input">
          <option value="q2-imatrix">q2-imatrix (96/128 GB)</option>
          <option value="q2-q4-imatrix">q2-q4-imatrix (96/128 GB)</option>
          <option value="q4-imatrix">q4-imatrix (256 GB+)</option>
          <option value="pro-q2-imatrix">pro-q2-imatrix (512 GB)</option>
        </select>
+        <span class="form-hint">Only used when auto-download is enabled below.</span>
      </div>
    </div>
+    <div class="form-row">
+      <label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
+        <input type="checkbox" id="s-ds4-auto-download">
+        <span>Auto-download the weight variant <span class="muted">— OFF by default; when off, ds4 only serves deepseek4 GGUFs you already have and errors if none resolves (never pulls tens of GB silently)</span></span>
+      </label>
+    </div>
+    <div class="form-row">
+      <label class="form-label">Model GGUF path <span class="muted">(optional override)</span></label>
+      <input type="text" id="s-ds4-model-path" class="form-input" placeholder="/AI/guffcache/…-ds4-Q2_K.gguf">
+      <span class="form-hint">Force ds4-server to load this exact GGUF (skips the variant download). Leave blank to auto-serve the requested deepseek4 model.</span>
+    </div>
+    <div class="form-row">
+      <label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
+        <input type="checkbox" id="s-ds4-ssd-streaming">
+        <span>SSD streaming <span class="muted">— stream MoE experts from disk/SSD (run a 100GB+ model on a small GPU + modest RAM; slower)</span></span>
+      </label>
+    </div>
    <div style="display:grid;grid-template-columns:1fr 1fr;gap:1rem;align-items:start">
      <div class="form-row" style="margin:0">
        <label class="form-label">Build target</label>
@@ -656,6 +674,9 @@ async function loadSettings(){
    document.getElementById('s-ds4-enabled').checked = !!ds4.enabled;
    document.getElementById('s-ds4-model-id').value = ds4.model_id ?? 'deepseek-v4';
    document.getElementById('s-ds4-variant').value = ds4.model_variant ?? 'q4-imatrix';
+    document.getElementById('s-ds4-model-path').value = ds4.model_path ?? '';
+    document.getElementById('s-ds4-auto-download').checked = !!ds4.auto_download;
+    document.getElementById('s-ds4-ssd-streaming').checked = !!ds4.ssd_streaming;
    document.getElementById('s-ds4-build-target').value = ds4.build_target ?? 'auto';
    document.getElementById('s-ds4-install-dir').value = ds4.install_dir ?? '';
    document.getElementById('s-ds4-auto-build').checked = ds4.auto_build !== false;
@@ -728,6 +749,9 @@ async function saveSettings(){
      enabled: document.getElementById('s-ds4-enabled').checked,
      model_id: document.getElementById('s-ds4-model-id').value.trim() || 'deepseek-v4',
      model_variant: document.getElementById('s-ds4-variant').value,
+      model_path: document.getElementById('s-ds4-model-path').value.trim(),
+      auto_download: document.getElementById('s-ds4-auto-download').checked,
+      ssd_streaming: document.getElementById('s-ds4-ssd-streaming').checked,
      build_target: document.getElementById('s-ds4-build-target').value,
      install_dir: document.getElementById('s-ds4-install-dir').value.trim(),
      auto_build: document.getElementById('s-ds4-auto-build').checked,

--- a/codai/api/ds4_worker.py
+++ b/codai/api/ds4_worker.py
@@ -35,6 +35,7 @@ import time
 import collections
 from pathlib import Path
 from typing import Optional
+from typing import Optional

 _lock = threading.RLock()
 # Single managed server (ds4 serves one DeepSeek V4 model). Keyed by model_id so a
@@ -195,22 +196,56 @@ def _health_ok(url: str) -> bool:
        return False


-def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
-    """Build + download (as needed), then start (or reuse) ds4-server.
+def resolve_service_key(cfg, model_file: Optional[str] = None):
+    """Decide which GGUF ds4-server should serve and the key to cache it under.

-    Returns the base URL. First call clones, builds, and downloads several GB, so the
-    timeout is generous. Raises RuntimeError if the service never becomes ready.
+    Preference: the requested model's own ``.gguf`` path → an explicit
+    ``cfg.model_path`` override → '' (download the variant as a last resort).
+    Returns ``(resolved_gguf_or_'', svc_key)``; the key is the file when we have
+    one (so different deepseek4 models get their own server), else ``model_id``.
+    """
+    resolved = ""
+    for cand in (model_file, getattr(cfg, "model_path", "") or ""):
+        cand = os.path.expanduser((cand or "").strip())
+        if cand and cand.lower().endswith(".gguf") and os.path.isfile(cand):
+            resolved = cand
+            break
+    svc_key = resolved or (getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4")
+    return resolved, svc_key
+
+
+def ensure_service(cfg, model_file: Optional[str] = None,
+                   ctx: Optional[int] = None,
+                   ready_timeout: float = 3600.0) -> str:
+    """Build (as needed), then start (or reuse) ds4-server serving the right GGUF.
+
+    ``model_file`` is the requested model's path; when it resolves to a local
+    ``.gguf`` (or ``cfg.model_path`` is set) ds4-server loads THAT via ``-m`` and
+    no weights are downloaded. Only when neither resolves does it fall back to
+    downloading ``cfg.model_variant``. ``ctx`` overrides the ds4 global context
+    (so the per-model n_ctx configuration wins). Returns the base URL.
    """
-    model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
+    resolved, svc_key = resolve_service_key(cfg, model_file)
    with _lock:
-        svc = _services.get(model_id)
+        svc = _services.get(svc_key)
        if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
            return svc["url"]
        if svc and svc["proc"].poll() is not None:
-            _services.pop(model_id, None)   # died — restart below
+            _services.pop(svc_key, None)   # died — restart below

        binary = ensure_built(cfg)
-        ensure_model(cfg)
+        if not resolved:
+            # No local deepseek4 GGUF resolved. Downloading ds4's own variant is
+            # OPT-IN (auto_download, off by default) — otherwise fail with a clear
+            # message instead of silently pulling tens of GB.
+            if bool(getattr(cfg, "auto_download", False)):
+                ensure_model(cfg)
+            else:
+                raise RuntimeError(
+                    "ds4: no local deepseek4 GGUF resolved for this request and "
+                    "auto-download is disabled. Point the model at a deepseek4 "
+                    ".gguf (or set ds4.model_path), or enable ds4.auto_download to "
+                    "fetch the configured weight variant.")

        install_dir = _install_dir(cfg)
        host = getattr(cfg, "host", "127.0.0.1") or "127.0.0.1"
@@ -220,21 +255,36 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
        connect_host = "127.0.0.1" if host in ("0.0.0.0", "::") else host
        url = f"http://{connect_host}:{port}"

+        # Per-model n_ctx (passed in) wins over the ds4 global ctx setting.
+        try:
+            ctx_val = int(ctx) if ctx else 0
+        except (TypeError, ValueError):
+            ctx_val = 0
+        if ctx_val <= 0:
+            ctx_val = int(getattr(cfg, "ctx", 100000) or 100000)
+
        cmd = [str(binary), "--host", host, "--port", str(port),
-               "--ctx", str(int(getattr(cfg, "ctx", 100000) or 100000)),
+               "--ctx", str(ctx_val),
               "--chdir", str(install_dir)]
+        if resolved:
+            cmd += ["-m", resolved]
+        if bool(getattr(cfg, "ssd_streaming", False)):
+            # Stream MoE experts from SSD/disk instead of full residency — lets a
+            # 100GB+ model run on a small GPU + modest RAM (slow but works).
+            cmd += ["--ssd-streaming"]
        extra = (getattr(cfg, "extra_args", "") or "").strip()
        if extra:
            import shlex
            cmd += shlex.split(extra)

+        print(f"[ds4] launching ds4-server: {' '.join(cmd)}", flush=True)
        proc = subprocess.Popen(
            cmd, cwd=str(install_dir), stdout=subprocess.PIPE,
            stderr=subprocess.STDOUT, text=True, bufsize=1,
        )
        tail = collections.deque(maxlen=15)
        threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
-        _services[model_id] = {"proc": proc, "port": port, "url": url}
+        _services[svc_key] = {"proc": proc, "port": port, "url": url}

    def _tail_msg():
        joined = " | ".join(list(tail)[-5:]).strip()
@@ -247,11 +297,11 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
                f"ds4-server exited (code {proc.returncode}) before becoming ready"
                + _tail_msg())
        if _health_ok(url):
-            print(f"[ds4] service ready for {model_id} at {url}", flush=True)
+            print(f"[ds4] service ready for {svc_key} at {url}", flush=True)
            return url
        time.sleep(2)
-    stop_service(model_id)
-    raise RuntimeError(f"ds4-server for {model_id} did not become ready in time"
+    stop_service(svc_key)
+    raise RuntimeError(f"ds4-server for {svc_key} did not become ready in time"
                       + _tail_msg())



--- a/codai/backends/ds4.py
+++ b/codai/backends/ds4.py
@@ -35,6 +35,7 @@ class Ds4Backend(ModelBackend):
            cfg = Ds4Config()
        self._cfg = cfg
        self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
+        self._svc_key: Optional[str] = None   # ds4_worker service key (file or model_id)
        self._url: Optional[str] = None
        self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
        self._last_usage: Dict = {}
@@ -46,7 +47,42 @@ class Ds4Backend(ModelBackend):
        from codai.api import ds4_worker
        if model_name:
            self._model_id = model_name
-        self._url = ds4_worker.ensure_service(self._cfg)
+        # Honour the model's configured context window (n_ctx / ctx from its
+        # models.json entry, forwarded by the manager) over the ds4 global ctx.
+        _ctx = kwargs.get("n_ctx", kwargs.get("ctx"))
+        if isinstance(_ctx, (list, tuple)):
+            _ctx = _ctx[0] if _ctx else None
+        try:
+            _ctx = int(_ctx) if _ctx else 0
+        except (TypeError, ValueError):
+            _ctx = 0
+        if _ctx > 0:
+            self._ctx = _ctx
+        # Resolve the requested model to a concrete .gguf so ds4-server serves THE
+        # deepseek4 model that was asked for (not a fixed downloaded variant).
+        model_file = self._resolve_gguf(model_name)
+        _resolved, self._svc_key = ds4_worker.resolve_service_key(self._cfg, model_file)
+        self._url = ds4_worker.ensure_service(
+            self._cfg, model_file=model_file, ctx=(self._ctx or None))
+
+    @staticmethod
+    def _resolve_gguf(model_name: str):
+        """Map a requested model name/path to a local .gguf path, if one exists."""
+        import os
+        if not model_name:
+            return None
+        cand = os.path.expanduser(model_name)
+        if cand.lower().endswith(".gguf") and os.path.isfile(cand):
+            return cand
+        # Bare filename / alias → look it up in the GGUF cache.
+        try:
+            from codai.models.cache import get_cached_model_path
+            p = get_cached_model_path(model_name)
+            if p and str(p).lower().endswith(".gguf") and os.path.isfile(p):
+                return str(p)
+        except Exception:
+            pass
+        return None

    def get_model_name(self) -> str:
        return self._model_id
@@ -59,7 +95,8 @@ class Ds4Backend(ModelBackend):

    def cleanup(self) -> None:
        from codai.api import ds4_worker
-        ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id))
+        key = getattr(self, "_svc_key", None) or getattr(self._cfg, "model_id", self._model_id)
+        ds4_worker.stop_service(key)
        self._url = None

    # ------------------------------------------------------------------ #

--- a/codai/config.py
+++ b/codai/config.py
@@ -229,11 +229,18 @@ class Ds4Config:
    repo_url: str = "https://github.com/antirez/ds4"
    install_dir: Optional[str] = None      # None = ~/.coderai/ds4
    build_target: str = "auto"             # auto|cuda-generic|cuda-spark|metal|cpu
-    model_variant: str = "q4-imatrix"      # download_model.sh variant
+    # The model ds4-server loads. Preferred: serve a deepseek4 GGUF the user
+    # already has — the requested model's own path is used when it resolves to a
+    # local .gguf, else `model_path` (an explicit override), else the variant is
+    # downloaded as a last resort. So you normally DON'T set model_variant at all.
+    model_path: str = ""                   # explicit GGUF for ds4-server -m (overrides the download)
+    auto_download: bool = False            # OFF by default: only download a variant when explicitly opted in
+    model_variant: str = "q4-imatrix"      # download_model.sh variant (used only when auto_download is on)
    model_id: str = "deepseek-v4"          # model id/alias that routes to ds4
    host: str = "127.0.0.1"
    port: int = 0                          # 0 = auto-pick a free port
    ctx: int = 100000                      # ds4-server --ctx context window
+    ssd_streaming: bool = False            # ds4-server --ssd-streaming: stream experts from SSD/disk
    extra_args: str = ""                   # extra flags passed to ds4-server
    auto_build: bool = True                # clone+build the binary if it's missing

@@ -579,11 +586,14 @@ class ConfigManager:
                "repo_url": self.config.ds4.repo_url,
                "install_dir": self.config.ds4.install_dir,
                "build_target": self.config.ds4.build_target,
+                "model_path": self.config.ds4.model_path,
+                "auto_download": self.config.ds4.auto_download,
                "model_variant": self.config.ds4.model_variant,
                "model_id": self.config.ds4.model_id,
                "host": self.config.ds4.host,
                "port": self.config.ds4.port,
                "ctx": self.config.ds4.ctx,
+                "ssd_streaming": self.config.ds4.ssd_streaming,
                "extra_args": self.config.ds4.extra_args,
                "auto_build": self.config.ds4.auto_build,
            },

--- a/codai/models/manager.py
+++ b/codai/models/manager.py
@@ -47,11 +47,85 @@ def get_active_ds4_config():
    return None


+_GGUF_ARCH_CACHE: Dict[tuple, str] = {}
+
+
+def _resolve_local_gguf(model_name: str):
+    """Map a model name/alias/path to a local .gguf file path, or None."""
+    if not model_name:
+        return None
+    cand = os.path.expanduser(model_name)
+    if cand.lower().endswith(".gguf") and os.path.isfile(cand):
+        return cand
+    try:
+        p = get_cached_model_path(model_name)
+        if p and str(p).lower().endswith(".gguf") and os.path.isfile(str(p)):
+            return str(p)
+    except Exception:
+        pass
+    return None
+
+
+def _gguf_architecture(path: str):
+    """Read ``general.architecture`` from a GGUF header. Cached by (path,mtime,size)."""
+    import struct
+    try:
+        st = os.stat(path)
+        key = (path, st.st_mtime_ns, st.st_size)
+    except OSError:
+        return None
+    if key in _GGUF_ARCH_CACHE:
+        return _GGUF_ARCH_CACHE[key] or None
+    arch = ""
+    _sz = {0: 1, 1: 1, 7: 1, 2: 2, 3: 2, 4: 4, 5: 4, 6: 4, 10: 8, 11: 8, 12: 8}
+    try:
+        with open(path, "rb") as f:
+            if f.read(4) != b"GGUF":
+                _GGUF_ARCH_CACHE[key] = ""
+                return None
+            f.read(4); f.read(8)  # version, tensor_count
+            kv_count = struct.unpack("<Q", f.read(8))[0]
+
+            def _rs(fh):
+                n = struct.unpack("<Q", fh.read(8))[0]
+                return fh.read(n)
+
+            for _ in range(kv_count):
+                k = _rs(f)
+                vtype = struct.unpack("<I", f.read(4))[0]
+                if vtype == 8:  # string
+                    v = _rs(f)
+                    if k == b"general.architecture":
+                        arch = v.decode("utf-8", "ignore")
+                        break
+                elif vtype in _sz:
+                    f.read(_sz[vtype])
+                elif vtype == 9:  # array — skip its elements
+                    atype = struct.unpack("<I", f.read(4))[0]
+                    alen = struct.unpack("<Q", f.read(8))[0]
+                    if atype == 8:
+                        for _ in range(alen):
+                            _rs(f)
+                    elif atype in _sz:
+                        f.read(_sz[atype] * alen)
+                    else:
+                        break  # unknown element type — stop
+                else:
+                    break  # unknown value type — stop
+    except Exception:
+        arch = ""
+    _GGUF_ARCH_CACHE[key] = arch
+    return arch or None
+
+
 def ds4_should_handle(model_name: str) -> bool:
-    """True when ds4 is enabled and ``model_name`` should be served by ds4-server.
+    """True when ds4 is enabled and ``model_name`` is a DeepSeek-V4 (ds4) model.

-    Matches the configured ``model_id`` (case-insensitive, short-name aware) or any
-    name containing ``deepseek-v4``, so the stock alias works without extra config.
+    Routing is by the GGUF ARCHITECTURE, not the filename: ds4 serves only genuine
+    ``deepseek4`` GGUFs (its own format). Mainline DeepSeek GGUFs (deepseek/
+    deepseek2/deepseek3/deepseek32) are left to llama.cpp. The configured
+    ``model_id`` alias still routes (covers the variant ds4 downloads itself, which
+    has no local file yet).
    """
    if not model_name:
        return False
@@ -63,7 +137,13 @@ def ds4_should_handle(model_name: str) -> bool:
    mid = (getattr(cfg, "model_id", "") or "").lower()
    if mid and (name == mid or short == mid):
        return True
-    return "deepseek-v4" in name
+    # Definitive: read the GGUF architecture for a local file — only deepseek4.
+    path = _resolve_local_gguf(model_name)
+    if path:
+        return (_gguf_architecture(path) or "").lower() == "deepseek4"
+    # No local file (HF id / not downloaded yet): conservative name check that
+    # matches ONLY the V4 marker, so mainline deepseek GGUFs aren't grabbed.
+    return "deepseek-v4" in name or "deepseek4" in name


 def _trim_cpu_ram() -> None: