feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file

- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read
  from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs
  stay on llama.cpp; the model_id alias still routes for the download case.
- ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a
  local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the
  managed service per file). No fixed-variant assumption.
- Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx).
- New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream
  MoE experts from SSD/disk), model_path (explicit -m override), and
  auto_download (OFF by default — only serve GGUFs already present; error clearly
  instead of silently pulling tens of GB; opt in to fetch model_variant).
- AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml
  ops) → ds4 for now; and ds4 routing/offload/text-only specifics.
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent 3834ecf5
...@@ -361,3 +361,44 @@ Incremental update (FAST, ~30 s — code-only changes, NO bundle recopy): ...@@ -361,3 +361,44 @@ Incremental update (FAST, ~30 s — code-only changes, NO bundle recopy):
- CAUTION: COPY adds/overwrites but does NOT delete files removed from the - CAUTION: COPY adds/overwrites but does NOT delete files removed from the
repo; the cleanup RUN prunes only known-stale paths (.git/venv*/dist/...). A repo; the cleanup RUN prunes only known-stale paths (.git/venv*/dist/...). A
source file deleted from codai/ lingers in the overlay until a full rebuild. source file deleted from codai/ lingers in the overlay until a full rebuild.
================================================================================
## DeepSeek-V4: llama.cpp support is pending — ds4 for now
================================================================================
DeepSeek-V4 GGUFs use the `deepseek4` architecture (keys like
`deepseek4.hash_layer_count`, `deepseek4.nextn_predict_layers`). The bundled
llama.cpp (0.3.30, ggml 0.15.1) supports up to `deepseek32` (V3.2) — NOT
`deepseek4`, so it fails fast with "Failed to load model from file" (this is an
ARCHITECTURE gap, not VRAM/offload, and no quantization changes it).
Mainline llama.cpp V4 support is IN PROGRESS but NOT yet merged — it needs new
ggml ops (a "hyperconnection" op and `GGML_OP_LIGHTNING_INDEXER`) tracked in OPEN
PRs ggml-org/llama.cpp#23122, #24162, #24231 (#22319 = the model request; #23706
"can't launch deepseek-v4-flash" = the exact Flash variant). So:
- Rebuilding llama.cpp from `master` today will NOT run V4 (ops unmerged).
- TODO (revisit soon): once those PRs land, rebuild llama-cpp-python from a
tag/commit that includes them (CUDA build), and run V4 GGUFs converted with
MAINLINE's convert_hf_to_gguf (so the arch/tensors match) — any quant.
- The user's `…-ds4-…Q2_K.gguf` is ds4 (antirez/ds4 / DwarfStar) format; it
will only ever load under ds4, not llama.cpp.
ds4 (codai/backends/ds4.py, config.ds4, codai/api/ds4_worker.py) is the native
DeepSeek-V4 engine: coderai owns ds4-server (clone+build) and proxies to it.
Routing & model selection (manager.ds4_should_handle):
- When ds4.enabled, a model routes to ds4 IFF its GGUF ARCHITECTURE is
`deepseek4` (read from the file header via manager._gguf_architecture, cached)
— NOT by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp. The
config.ds4.model_id alias also routes (for the downloaded-variant case).
- ds4-server serves the REQUESTED GGUF: Ds4Backend resolves the model to a local
.gguf and launches `ds4-server -m <that file>`. So any deepseek4 GGUF you have
is served directly — no per-model setup.
- `--ctx` honours the model's per-entry n_ctx (forwarded via kwargs), not just
config.ds4.ctx.
- config.ds4.ssd_streaming → `--ssd-streaming` (stream MoE experts from SSD/disk;
run a 100GB+ model on a small GPU). config.ds4.model_path = explicit -m override.
- config.ds4.auto_download is OFF by default: ds4 only serves GGUFs you already
have; with no local file it errors (clear message) instead of pulling tens of
GB. Enable auto_download to fetch config.ds4.model_variant as a fallback.
- ds4-server exposes only TEXT APIs (chat/completions/responses/anthropic) — no
image generation; that needs a separate diffusion model.
...@@ -3163,11 +3163,14 @@ async def api_get_settings(username: str = Depends(require_admin)): ...@@ -3163,11 +3163,14 @@ async def api_get_settings(username: str = Depends(require_admin)):
"repo_url": c.ds4.repo_url, "repo_url": c.ds4.repo_url,
"install_dir": c.ds4.install_dir, "install_dir": c.ds4.install_dir,
"build_target": c.ds4.build_target, "build_target": c.ds4.build_target,
"model_path": c.ds4.model_path,
"auto_download": c.ds4.auto_download,
"model_variant": c.ds4.model_variant, "model_variant": c.ds4.model_variant,
"model_id": c.ds4.model_id, "model_id": c.ds4.model_id,
"host": c.ds4.host, "host": c.ds4.host,
"port": c.ds4.port, "port": c.ds4.port,
"ctx": c.ds4.ctx, "ctx": c.ds4.ctx,
"ssd_streaming": c.ds4.ssd_streaming,
"extra_args": c.ds4.extra_args, "extra_args": c.ds4.extra_args,
"auto_build": c.ds4.auto_build, "auto_build": c.ds4.auto_build,
}, },
...@@ -3431,6 +3434,12 @@ async def api_save_settings(request: Request, username: str = Depends(require_ad ...@@ -3431,6 +3434,12 @@ async def api_save_settings(request: Request, username: str = Depends(require_ad
c.ds4.install_dir = (d.get("install_dir") or "").strip() or None c.ds4.install_dir = (d.get("install_dir") or "").strip() or None
if "build_target" in d: if "build_target" in d:
c.ds4.build_target = (d.get("build_target") or "auto").strip() c.ds4.build_target = (d.get("build_target") or "auto").strip()
if "model_path" in d:
c.ds4.model_path = (d.get("model_path") or "").strip()
if "auto_download" in d:
c.ds4.auto_download = bool(d["auto_download"])
if "ssd_streaming" in d:
c.ds4.ssd_streaming = bool(d["ssd_streaming"])
if "model_variant" in d: if "model_variant" in d:
c.ds4.model_variant = (d.get("model_variant") or c.ds4.model_variant).strip() c.ds4.model_variant = (d.get("model_variant") or c.ds4.model_variant).strip()
if "model_id" in d: if "model_id" in d:
......
...@@ -394,18 +394,36 @@ ...@@ -394,18 +394,36 @@
<div class="form-row" style="margin:0"> <div class="form-row" style="margin:0">
<label class="form-label">Model id / alias</label> <label class="form-label">Model id / alias</label>
<input type="text" id="s-ds4-model-id" class="form-input" placeholder="deepseek-v4"> <input type="text" id="s-ds4-model-id" class="form-input" placeholder="deepseek-v4">
<span class="form-hint">Requests for this id (or any name containing "deepseek-v4") route to ds4.</span> <span class="form-hint">When enabled, ds4 automatically serves any <b>deepseek4</b>-architecture GGUF you have (detected by reading the file); mainline DeepSeek GGUFs stay on llama.cpp. This alias also routes (for ds4's own downloaded variant).</span>
</div> </div>
<div class="form-row" style="margin:0"> <div class="form-row" style="margin:0">
<label class="form-label">Weight variant</label> <label class="form-label">Weight variant <span class="muted">(download fallback)</span></label>
<select id="s-ds4-variant" class="form-input"> <select id="s-ds4-variant" class="form-input">
<option value="q2-imatrix">q2-imatrix (96/128 GB)</option> <option value="q2-imatrix">q2-imatrix (96/128 GB)</option>
<option value="q2-q4-imatrix">q2-q4-imatrix (96/128 GB)</option> <option value="q2-q4-imatrix">q2-q4-imatrix (96/128 GB)</option>
<option value="q4-imatrix">q4-imatrix (256 GB+)</option> <option value="q4-imatrix">q4-imatrix (256 GB+)</option>
<option value="pro-q2-imatrix">pro-q2-imatrix (512 GB)</option> <option value="pro-q2-imatrix">pro-q2-imatrix (512 GB)</option>
</select> </select>
<span class="form-hint">Only used when auto-download is enabled below.</span>
</div> </div>
</div> </div>
<div class="form-row">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-ds4-auto-download">
<span>Auto-download the weight variant <span class="muted">— OFF by default; when off, ds4 only serves deepseek4 GGUFs you already have and errors if none resolves (never pulls tens of GB silently)</span></span>
</label>
</div>
<div class="form-row">
<label class="form-label">Model GGUF path <span class="muted">(optional override)</span></label>
<input type="text" id="s-ds4-model-path" class="form-input" placeholder="/AI/guffcache/…-ds4-Q2_K.gguf">
<span class="form-hint">Force ds4-server to load this exact GGUF (skips the variant download). Leave blank to auto-serve the requested deepseek4 model.</span>
</div>
<div class="form-row">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-ds4-ssd-streaming">
<span>SSD streaming <span class="muted">— stream MoE experts from disk/SSD (run a 100GB+ model on a small GPU + modest RAM; slower)</span></span>
</label>
</div>
<div style="display:grid;grid-template-columns:1fr 1fr;gap:1rem;align-items:start"> <div style="display:grid;grid-template-columns:1fr 1fr;gap:1rem;align-items:start">
<div class="form-row" style="margin:0"> <div class="form-row" style="margin:0">
<label class="form-label">Build target</label> <label class="form-label">Build target</label>
...@@ -656,6 +674,9 @@ async function loadSettings(){ ...@@ -656,6 +674,9 @@ async function loadSettings(){
document.getElementById('s-ds4-enabled').checked = !!ds4.enabled; document.getElementById('s-ds4-enabled').checked = !!ds4.enabled;
document.getElementById('s-ds4-model-id').value = ds4.model_id ?? 'deepseek-v4'; document.getElementById('s-ds4-model-id').value = ds4.model_id ?? 'deepseek-v4';
document.getElementById('s-ds4-variant').value = ds4.model_variant ?? 'q4-imatrix'; document.getElementById('s-ds4-variant').value = ds4.model_variant ?? 'q4-imatrix';
document.getElementById('s-ds4-model-path').value = ds4.model_path ?? '';
document.getElementById('s-ds4-auto-download').checked = !!ds4.auto_download;
document.getElementById('s-ds4-ssd-streaming').checked = !!ds4.ssd_streaming;
document.getElementById('s-ds4-build-target').value = ds4.build_target ?? 'auto'; document.getElementById('s-ds4-build-target').value = ds4.build_target ?? 'auto';
document.getElementById('s-ds4-install-dir').value = ds4.install_dir ?? ''; document.getElementById('s-ds4-install-dir').value = ds4.install_dir ?? '';
document.getElementById('s-ds4-auto-build').checked = ds4.auto_build !== false; document.getElementById('s-ds4-auto-build').checked = ds4.auto_build !== false;
...@@ -728,6 +749,9 @@ async function saveSettings(){ ...@@ -728,6 +749,9 @@ async function saveSettings(){
enabled: document.getElementById('s-ds4-enabled').checked, enabled: document.getElementById('s-ds4-enabled').checked,
model_id: document.getElementById('s-ds4-model-id').value.trim() || 'deepseek-v4', model_id: document.getElementById('s-ds4-model-id').value.trim() || 'deepseek-v4',
model_variant: document.getElementById('s-ds4-variant').value, model_variant: document.getElementById('s-ds4-variant').value,
model_path: document.getElementById('s-ds4-model-path').value.trim(),
auto_download: document.getElementById('s-ds4-auto-download').checked,
ssd_streaming: document.getElementById('s-ds4-ssd-streaming').checked,
build_target: document.getElementById('s-ds4-build-target').value, build_target: document.getElementById('s-ds4-build-target').value,
install_dir: document.getElementById('s-ds4-install-dir').value.trim(), install_dir: document.getElementById('s-ds4-install-dir').value.trim(),
auto_build: document.getElementById('s-ds4-auto-build').checked, auto_build: document.getElementById('s-ds4-auto-build').checked,
......
...@@ -35,6 +35,7 @@ import time ...@@ -35,6 +35,7 @@ import time
import collections import collections
from pathlib import Path from pathlib import Path
from typing import Optional from typing import Optional
from typing import Optional
_lock = threading.RLock() _lock = threading.RLock()
# Single managed server (ds4 serves one DeepSeek V4 model). Keyed by model_id so a # Single managed server (ds4 serves one DeepSeek V4 model). Keyed by model_id so a
...@@ -195,22 +196,56 @@ def _health_ok(url: str) -> bool: ...@@ -195,22 +196,56 @@ def _health_ok(url: str) -> bool:
return False return False
def ensure_service(cfg, ready_timeout: float = 3600.0) -> str: def resolve_service_key(cfg, model_file: Optional[str] = None):
"""Build + download (as needed), then start (or reuse) ds4-server. """Decide which GGUF ds4-server should serve and the key to cache it under.
Returns the base URL. First call clones, builds, and downloads several GB, so the Preference: the requested model's own ``.gguf`` path → an explicit
timeout is generous. Raises RuntimeError if the service never becomes ready. ``cfg.model_path`` override → '' (download the variant as a last resort).
Returns ``(resolved_gguf_or_'', svc_key)``; the key is the file when we have
one (so different deepseek4 models get their own server), else ``model_id``.
"""
resolved = ""
for cand in (model_file, getattr(cfg, "model_path", "") or ""):
cand = os.path.expanduser((cand or "").strip())
if cand and cand.lower().endswith(".gguf") and os.path.isfile(cand):
resolved = cand
break
svc_key = resolved or (getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4")
return resolved, svc_key
def ensure_service(cfg, model_file: Optional[str] = None,
ctx: Optional[int] = None,
ready_timeout: float = 3600.0) -> str:
"""Build (as needed), then start (or reuse) ds4-server serving the right GGUF.
``model_file`` is the requested model's path; when it resolves to a local
``.gguf`` (or ``cfg.model_path`` is set) ds4-server loads THAT via ``-m`` and
no weights are downloaded. Only when neither resolves does it fall back to
downloading ``cfg.model_variant``. ``ctx`` overrides the ds4 global context
(so the per-model n_ctx configuration wins). Returns the base URL.
""" """
model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4" resolved, svc_key = resolve_service_key(cfg, model_file)
with _lock: with _lock:
svc = _services.get(model_id) svc = _services.get(svc_key)
if svc and svc["proc"].poll() is None and _health_ok(svc["url"]): if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
return svc["url"] return svc["url"]
if svc and svc["proc"].poll() is not None: if svc and svc["proc"].poll() is not None:
_services.pop(model_id, None) # died — restart below _services.pop(svc_key, None) # died — restart below
binary = ensure_built(cfg) binary = ensure_built(cfg)
ensure_model(cfg) if not resolved:
# No local deepseek4 GGUF resolved. Downloading ds4's own variant is
# OPT-IN (auto_download, off by default) — otherwise fail with a clear
# message instead of silently pulling tens of GB.
if bool(getattr(cfg, "auto_download", False)):
ensure_model(cfg)
else:
raise RuntimeError(
"ds4: no local deepseek4 GGUF resolved for this request and "
"auto-download is disabled. Point the model at a deepseek4 "
".gguf (or set ds4.model_path), or enable ds4.auto_download to "
"fetch the configured weight variant.")
install_dir = _install_dir(cfg) install_dir = _install_dir(cfg)
host = getattr(cfg, "host", "127.0.0.1") or "127.0.0.1" host = getattr(cfg, "host", "127.0.0.1") or "127.0.0.1"
...@@ -220,21 +255,36 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str: ...@@ -220,21 +255,36 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
connect_host = "127.0.0.1" if host in ("0.0.0.0", "::") else host connect_host = "127.0.0.1" if host in ("0.0.0.0", "::") else host
url = f"http://{connect_host}:{port}" url = f"http://{connect_host}:{port}"
# Per-model n_ctx (passed in) wins over the ds4 global ctx setting.
try:
ctx_val = int(ctx) if ctx else 0
except (TypeError, ValueError):
ctx_val = 0
if ctx_val <= 0:
ctx_val = int(getattr(cfg, "ctx", 100000) or 100000)
cmd = [str(binary), "--host", host, "--port", str(port), cmd = [str(binary), "--host", host, "--port", str(port),
"--ctx", str(int(getattr(cfg, "ctx", 100000) or 100000)), "--ctx", str(ctx_val),
"--chdir", str(install_dir)] "--chdir", str(install_dir)]
if resolved:
cmd += ["-m", resolved]
if bool(getattr(cfg, "ssd_streaming", False)):
# Stream MoE experts from SSD/disk instead of full residency — lets a
# 100GB+ model run on a small GPU + modest RAM (slow but works).
cmd += ["--ssd-streaming"]
extra = (getattr(cfg, "extra_args", "") or "").strip() extra = (getattr(cfg, "extra_args", "") or "").strip()
if extra: if extra:
import shlex import shlex
cmd += shlex.split(extra) cmd += shlex.split(extra)
print(f"[ds4] launching ds4-server: {' '.join(cmd)}", flush=True)
proc = subprocess.Popen( proc = subprocess.Popen(
cmd, cwd=str(install_dir), stdout=subprocess.PIPE, cmd, cwd=str(install_dir), stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, text=True, bufsize=1, stderr=subprocess.STDOUT, text=True, bufsize=1,
) )
tail = collections.deque(maxlen=15) tail = collections.deque(maxlen=15)
threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start() threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
_services[model_id] = {"proc": proc, "port": port, "url": url} _services[svc_key] = {"proc": proc, "port": port, "url": url}
def _tail_msg(): def _tail_msg():
joined = " | ".join(list(tail)[-5:]).strip() joined = " | ".join(list(tail)[-5:]).strip()
...@@ -247,11 +297,11 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str: ...@@ -247,11 +297,11 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
f"ds4-server exited (code {proc.returncode}) before becoming ready" f"ds4-server exited (code {proc.returncode}) before becoming ready"
+ _tail_msg()) + _tail_msg())
if _health_ok(url): if _health_ok(url):
print(f"[ds4] service ready for {model_id} at {url}", flush=True) print(f"[ds4] service ready for {svc_key} at {url}", flush=True)
return url return url
time.sleep(2) time.sleep(2)
stop_service(model_id) stop_service(svc_key)
raise RuntimeError(f"ds4-server for {model_id} did not become ready in time" raise RuntimeError(f"ds4-server for {svc_key} did not become ready in time"
+ _tail_msg()) + _tail_msg())
......
...@@ -35,6 +35,7 @@ class Ds4Backend(ModelBackend): ...@@ -35,6 +35,7 @@ class Ds4Backend(ModelBackend):
cfg = Ds4Config() cfg = Ds4Config()
self._cfg = cfg self._cfg = cfg
self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4" self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
self._svc_key: Optional[str] = None # ds4_worker service key (file or model_id)
self._url: Optional[str] = None self._url: Optional[str] = None
self._ctx = int(getattr(cfg, "ctx", 100000) or 100000) self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
self._last_usage: Dict = {} self._last_usage: Dict = {}
...@@ -46,7 +47,42 @@ class Ds4Backend(ModelBackend): ...@@ -46,7 +47,42 @@ class Ds4Backend(ModelBackend):
from codai.api import ds4_worker from codai.api import ds4_worker
if model_name: if model_name:
self._model_id = model_name self._model_id = model_name
self._url = ds4_worker.ensure_service(self._cfg) # Honour the model's configured context window (n_ctx / ctx from its
# models.json entry, forwarded by the manager) over the ds4 global ctx.
_ctx = kwargs.get("n_ctx", kwargs.get("ctx"))
if isinstance(_ctx, (list, tuple)):
_ctx = _ctx[0] if _ctx else None
try:
_ctx = int(_ctx) if _ctx else 0
except (TypeError, ValueError):
_ctx = 0
if _ctx > 0:
self._ctx = _ctx
# Resolve the requested model to a concrete .gguf so ds4-server serves THE
# deepseek4 model that was asked for (not a fixed downloaded variant).
model_file = self._resolve_gguf(model_name)
_resolved, self._svc_key = ds4_worker.resolve_service_key(self._cfg, model_file)
self._url = ds4_worker.ensure_service(
self._cfg, model_file=model_file, ctx=(self._ctx or None))
@staticmethod
def _resolve_gguf(model_name: str):
"""Map a requested model name/path to a local .gguf path, if one exists."""
import os
if not model_name:
return None
cand = os.path.expanduser(model_name)
if cand.lower().endswith(".gguf") and os.path.isfile(cand):
return cand
# Bare filename / alias → look it up in the GGUF cache.
try:
from codai.models.cache import get_cached_model_path
p = get_cached_model_path(model_name)
if p and str(p).lower().endswith(".gguf") and os.path.isfile(p):
return str(p)
except Exception:
pass
return None
def get_model_name(self) -> str: def get_model_name(self) -> str:
return self._model_id return self._model_id
...@@ -59,7 +95,8 @@ class Ds4Backend(ModelBackend): ...@@ -59,7 +95,8 @@ class Ds4Backend(ModelBackend):
def cleanup(self) -> None: def cleanup(self) -> None:
from codai.api import ds4_worker from codai.api import ds4_worker
ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id)) key = getattr(self, "_svc_key", None) or getattr(self._cfg, "model_id", self._model_id)
ds4_worker.stop_service(key)
self._url = None self._url = None
# ------------------------------------------------------------------ # # ------------------------------------------------------------------ #
......
...@@ -229,11 +229,18 @@ class Ds4Config: ...@@ -229,11 +229,18 @@ class Ds4Config:
repo_url: str = "https://github.com/antirez/ds4" repo_url: str = "https://github.com/antirez/ds4"
install_dir: Optional[str] = None # None = ~/.coderai/ds4 install_dir: Optional[str] = None # None = ~/.coderai/ds4
build_target: str = "auto" # auto|cuda-generic|cuda-spark|metal|cpu build_target: str = "auto" # auto|cuda-generic|cuda-spark|metal|cpu
model_variant: str = "q4-imatrix" # download_model.sh variant # The model ds4-server loads. Preferred: serve a deepseek4 GGUF the user
# already has — the requested model's own path is used when it resolves to a
# local .gguf, else `model_path` (an explicit override), else the variant is
# downloaded as a last resort. So you normally DON'T set model_variant at all.
model_path: str = "" # explicit GGUF for ds4-server -m (overrides the download)
auto_download: bool = False # OFF by default: only download a variant when explicitly opted in
model_variant: str = "q4-imatrix" # download_model.sh variant (used only when auto_download is on)
model_id: str = "deepseek-v4" # model id/alias that routes to ds4 model_id: str = "deepseek-v4" # model id/alias that routes to ds4
host: str = "127.0.0.1" host: str = "127.0.0.1"
port: int = 0 # 0 = auto-pick a free port port: int = 0 # 0 = auto-pick a free port
ctx: int = 100000 # ds4-server --ctx context window ctx: int = 100000 # ds4-server --ctx context window
ssd_streaming: bool = False # ds4-server --ssd-streaming: stream experts from SSD/disk
extra_args: str = "" # extra flags passed to ds4-server extra_args: str = "" # extra flags passed to ds4-server
auto_build: bool = True # clone+build the binary if it's missing auto_build: bool = True # clone+build the binary if it's missing
...@@ -579,11 +586,14 @@ class ConfigManager: ...@@ -579,11 +586,14 @@ class ConfigManager:
"repo_url": self.config.ds4.repo_url, "repo_url": self.config.ds4.repo_url,
"install_dir": self.config.ds4.install_dir, "install_dir": self.config.ds4.install_dir,
"build_target": self.config.ds4.build_target, "build_target": self.config.ds4.build_target,
"model_path": self.config.ds4.model_path,
"auto_download": self.config.ds4.auto_download,
"model_variant": self.config.ds4.model_variant, "model_variant": self.config.ds4.model_variant,
"model_id": self.config.ds4.model_id, "model_id": self.config.ds4.model_id,
"host": self.config.ds4.host, "host": self.config.ds4.host,
"port": self.config.ds4.port, "port": self.config.ds4.port,
"ctx": self.config.ds4.ctx, "ctx": self.config.ds4.ctx,
"ssd_streaming": self.config.ds4.ssd_streaming,
"extra_args": self.config.ds4.extra_args, "extra_args": self.config.ds4.extra_args,
"auto_build": self.config.ds4.auto_build, "auto_build": self.config.ds4.auto_build,
}, },
......
...@@ -47,11 +47,85 @@ def get_active_ds4_config(): ...@@ -47,11 +47,85 @@ def get_active_ds4_config():
return None return None
_GGUF_ARCH_CACHE: Dict[tuple, str] = {}
def _resolve_local_gguf(model_name: str):
"""Map a model name/alias/path to a local .gguf file path, or None."""
if not model_name:
return None
cand = os.path.expanduser(model_name)
if cand.lower().endswith(".gguf") and os.path.isfile(cand):
return cand
try:
p = get_cached_model_path(model_name)
if p and str(p).lower().endswith(".gguf") and os.path.isfile(str(p)):
return str(p)
except Exception:
pass
return None
def _gguf_architecture(path: str):
"""Read ``general.architecture`` from a GGUF header. Cached by (path,mtime,size)."""
import struct
try:
st = os.stat(path)
key = (path, st.st_mtime_ns, st.st_size)
except OSError:
return None
if key in _GGUF_ARCH_CACHE:
return _GGUF_ARCH_CACHE[key] or None
arch = ""
_sz = {0: 1, 1: 1, 7: 1, 2: 2, 3: 2, 4: 4, 5: 4, 6: 4, 10: 8, 11: 8, 12: 8}
try:
with open(path, "rb") as f:
if f.read(4) != b"GGUF":
_GGUF_ARCH_CACHE[key] = ""
return None
f.read(4); f.read(8) # version, tensor_count
kv_count = struct.unpack("<Q", f.read(8))[0]
def _rs(fh):
n = struct.unpack("<Q", fh.read(8))[0]
return fh.read(n)
for _ in range(kv_count):
k = _rs(f)
vtype = struct.unpack("<I", f.read(4))[0]
if vtype == 8: # string
v = _rs(f)
if k == b"general.architecture":
arch = v.decode("utf-8", "ignore")
break
elif vtype in _sz:
f.read(_sz[vtype])
elif vtype == 9: # array — skip its elements
atype = struct.unpack("<I", f.read(4))[0]
alen = struct.unpack("<Q", f.read(8))[0]
if atype == 8:
for _ in range(alen):
_rs(f)
elif atype in _sz:
f.read(_sz[atype] * alen)
else:
break # unknown element type — stop
else:
break # unknown value type — stop
except Exception:
arch = ""
_GGUF_ARCH_CACHE[key] = arch
return arch or None
def ds4_should_handle(model_name: str) -> bool: def ds4_should_handle(model_name: str) -> bool:
"""True when ds4 is enabled and ``model_name`` should be served by ds4-server. """True when ds4 is enabled and ``model_name`` is a DeepSeek-V4 (ds4) model.
Matches the configured ``model_id`` (case-insensitive, short-name aware) or any Routing is by the GGUF ARCHITECTURE, not the filename: ds4 serves only genuine
name containing ``deepseek-v4``, so the stock alias works without extra config. ``deepseek4`` GGUFs (its own format). Mainline DeepSeek GGUFs (deepseek/
deepseek2/deepseek3/deepseek32) are left to llama.cpp. The configured
``model_id`` alias still routes (covers the variant ds4 downloads itself, which
has no local file yet).
""" """
if not model_name: if not model_name:
return False return False
...@@ -63,7 +137,13 @@ def ds4_should_handle(model_name: str) -> bool: ...@@ -63,7 +137,13 @@ def ds4_should_handle(model_name: str) -> bool:
mid = (getattr(cfg, "model_id", "") or "").lower() mid = (getattr(cfg, "model_id", "") or "").lower()
if mid and (name == mid or short == mid): if mid and (name == mid or short == mid):
return True return True
return "deepseek-v4" in name # Definitive: read the GGUF architecture for a local file — only deepseek4.
path = _resolve_local_gguf(model_name)
if path:
return (_gguf_architecture(path) or "").lower() == "deepseek4"
# No local file (HF id / not downloaded yet): conservative name check that
# matches ONLY the V4 marker, so mainline deepseek GGUFs aren't grabbed.
return "deepseek-v4" in name or "deepseek4" in name
def _trim_cpu_ram() -> None: def _trim_cpu_ram() -> None:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment