feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file

- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read
  from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs
  stay on llama.cpp; the model_id alias still routes for the download case.
- ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a
  local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the
  managed service per file). No fixed-variant assumption.
- Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx).
- New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream
  MoE experts from SSD/disk), model_path (explicit -m override), and
  auto_download (OFF by default — only serve GGUFs already present; error clearly
  instead of silently pulling tens of GB; opt in to fetch model_variant).
- AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml
  ops) → ds4 for now; and ds4 routing/offload/text-only specifics.
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent 3834ecf5
......@@ -361,3 +361,44 @@ Incremental update (FAST, ~30 s — code-only changes, NO bundle recopy):
- CAUTION: COPY adds/overwrites but does NOT delete files removed from the
repo; the cleanup RUN prunes only known-stale paths (.git/venv*/dist/...). A
source file deleted from codai/ lingers in the overlay until a full rebuild.
================================================================================
## DeepSeek-V4: llama.cpp support is pending — ds4 for now
================================================================================
DeepSeek-V4 GGUFs use the `deepseek4` architecture (keys like
`deepseek4.hash_layer_count`, `deepseek4.nextn_predict_layers`). The bundled
llama.cpp (0.3.30, ggml 0.15.1) supports up to `deepseek32` (V3.2) — NOT
`deepseek4`, so it fails fast with "Failed to load model from file" (this is an
ARCHITECTURE gap, not VRAM/offload, and no quantization changes it).
Mainline llama.cpp V4 support is IN PROGRESS but NOT yet merged — it needs new
ggml ops (a "hyperconnection" op and `GGML_OP_LIGHTNING_INDEXER`) tracked in OPEN
PRs ggml-org/llama.cpp#23122, #24162, #24231 (#22319 = the model request; #23706
"can't launch deepseek-v4-flash" = the exact Flash variant). So:
- Rebuilding llama.cpp from `master` today will NOT run V4 (ops unmerged).
- TODO (revisit soon): once those PRs land, rebuild llama-cpp-python from a
tag/commit that includes them (CUDA build), and run V4 GGUFs converted with
MAINLINE's convert_hf_to_gguf (so the arch/tensors match) — any quant.
- The user's `…-ds4-…Q2_K.gguf` is ds4 (antirez/ds4 / DwarfStar) format; it
will only ever load under ds4, not llama.cpp.
ds4 (codai/backends/ds4.py, config.ds4, codai/api/ds4_worker.py) is the native
DeepSeek-V4 engine: coderai owns ds4-server (clone+build) and proxies to it.
Routing & model selection (manager.ds4_should_handle):
- When ds4.enabled, a model routes to ds4 IFF its GGUF ARCHITECTURE is
`deepseek4` (read from the file header via manager._gguf_architecture, cached)
— NOT by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp. The
config.ds4.model_id alias also routes (for the downloaded-variant case).
- ds4-server serves the REQUESTED GGUF: Ds4Backend resolves the model to a local
.gguf and launches `ds4-server -m <that file>`. So any deepseek4 GGUF you have
is served directly — no per-model setup.
- `--ctx` honours the model's per-entry n_ctx (forwarded via kwargs), not just
config.ds4.ctx.
- config.ds4.ssd_streaming → `--ssd-streaming` (stream MoE experts from SSD/disk;
run a 100GB+ model on a small GPU). config.ds4.model_path = explicit -m override.
- config.ds4.auto_download is OFF by default: ds4 only serves GGUFs you already
have; with no local file it errors (clear message) instead of pulling tens of
GB. Enable auto_download to fetch config.ds4.model_variant as a fallback.
- ds4-server exposes only TEXT APIs (chat/completions/responses/anthropic) — no
image generation; that needs a separate diffusion model.
......@@ -3163,11 +3163,14 @@ async def api_get_settings(username: str = Depends(require_admin)):
"repo_url": c.ds4.repo_url,
"install_dir": c.ds4.install_dir,
"build_target": c.ds4.build_target,
"model_path": c.ds4.model_path,
"auto_download": c.ds4.auto_download,
"model_variant": c.ds4.model_variant,
"model_id": c.ds4.model_id,
"host": c.ds4.host,
"port": c.ds4.port,
"ctx": c.ds4.ctx,
"ssd_streaming": c.ds4.ssd_streaming,
"extra_args": c.ds4.extra_args,
"auto_build": c.ds4.auto_build,
},
......@@ -3431,6 +3434,12 @@ async def api_save_settings(request: Request, username: str = Depends(require_ad
c.ds4.install_dir = (d.get("install_dir") or "").strip() or None
if "build_target" in d:
c.ds4.build_target = (d.get("build_target") or "auto").strip()
if "model_path" in d:
c.ds4.model_path = (d.get("model_path") or "").strip()
if "auto_download" in d:
c.ds4.auto_download = bool(d["auto_download"])
if "ssd_streaming" in d:
c.ds4.ssd_streaming = bool(d["ssd_streaming"])
if "model_variant" in d:
c.ds4.model_variant = (d.get("model_variant") or c.ds4.model_variant).strip()
if "model_id" in d:
......
......@@ -394,18 +394,36 @@
<div class="form-row" style="margin:0">
<label class="form-label">Model id / alias</label>
<input type="text" id="s-ds4-model-id" class="form-input" placeholder="deepseek-v4">
<span class="form-hint">Requests for this id (or any name containing "deepseek-v4") route to ds4.</span>
<span class="form-hint">When enabled, ds4 automatically serves any <b>deepseek4</b>-architecture GGUF you have (detected by reading the file); mainline DeepSeek GGUFs stay on llama.cpp. This alias also routes (for ds4's own downloaded variant).</span>
</div>
<div class="form-row" style="margin:0">
<label class="form-label">Weight variant</label>
<label class="form-label">Weight variant <span class="muted">(download fallback)</span></label>
<select id="s-ds4-variant" class="form-input">
<option value="q2-imatrix">q2-imatrix (96/128 GB)</option>
<option value="q2-q4-imatrix">q2-q4-imatrix (96/128 GB)</option>
<option value="q4-imatrix">q4-imatrix (256 GB+)</option>
<option value="pro-q2-imatrix">pro-q2-imatrix (512 GB)</option>
</select>
<span class="form-hint">Only used when auto-download is enabled below.</span>
</div>
</div>
<div class="form-row">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-ds4-auto-download">
<span>Auto-download the weight variant <span class="muted">— OFF by default; when off, ds4 only serves deepseek4 GGUFs you already have and errors if none resolves (never pulls tens of GB silently)</span></span>
</label>
</div>
<div class="form-row">
<label class="form-label">Model GGUF path <span class="muted">(optional override)</span></label>
<input type="text" id="s-ds4-model-path" class="form-input" placeholder="/AI/guffcache/…-ds4-Q2_K.gguf">
<span class="form-hint">Force ds4-server to load this exact GGUF (skips the variant download). Leave blank to auto-serve the requested deepseek4 model.</span>
</div>
<div class="form-row">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-ds4-ssd-streaming">
<span>SSD streaming <span class="muted">— stream MoE experts from disk/SSD (run a 100GB+ model on a small GPU + modest RAM; slower)</span></span>
</label>
</div>
<div style="display:grid;grid-template-columns:1fr 1fr;gap:1rem;align-items:start">
<div class="form-row" style="margin:0">
<label class="form-label">Build target</label>
......@@ -656,6 +674,9 @@ async function loadSettings(){
document.getElementById('s-ds4-enabled').checked = !!ds4.enabled;
document.getElementById('s-ds4-model-id').value = ds4.model_id ?? 'deepseek-v4';
document.getElementById('s-ds4-variant').value = ds4.model_variant ?? 'q4-imatrix';
document.getElementById('s-ds4-model-path').value = ds4.model_path ?? '';
document.getElementById('s-ds4-auto-download').checked = !!ds4.auto_download;
document.getElementById('s-ds4-ssd-streaming').checked = !!ds4.ssd_streaming;
document.getElementById('s-ds4-build-target').value = ds4.build_target ?? 'auto';
document.getElementById('s-ds4-install-dir').value = ds4.install_dir ?? '';
document.getElementById('s-ds4-auto-build').checked = ds4.auto_build !== false;
......@@ -728,6 +749,9 @@ async function saveSettings(){
enabled: document.getElementById('s-ds4-enabled').checked,
model_id: document.getElementById('s-ds4-model-id').value.trim() || 'deepseek-v4',
model_variant: document.getElementById('s-ds4-variant').value,
model_path: document.getElementById('s-ds4-model-path').value.trim(),
auto_download: document.getElementById('s-ds4-auto-download').checked,
ssd_streaming: document.getElementById('s-ds4-ssd-streaming').checked,
build_target: document.getElementById('s-ds4-build-target').value,
install_dir: document.getElementById('s-ds4-install-dir').value.trim(),
auto_build: document.getElementById('s-ds4-auto-build').checked,
......
......@@ -35,6 +35,7 @@ import time
import collections
from pathlib import Path
from typing import Optional
from typing import Optional
_lock = threading.RLock()
# Single managed server (ds4 serves one DeepSeek V4 model). Keyed by model_id so a
......@@ -195,22 +196,56 @@ def _health_ok(url: str) -> bool:
return False
def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
"""Build + download (as needed), then start (or reuse) ds4-server.
def resolve_service_key(cfg, model_file: Optional[str] = None):
"""Decide which GGUF ds4-server should serve and the key to cache it under.
Returns the base URL. First call clones, builds, and downloads several GB, so the
timeout is generous. Raises RuntimeError if the service never becomes ready.
Preference: the requested model's own ``.gguf`` path → an explicit
``cfg.model_path`` override → '' (download the variant as a last resort).
Returns ``(resolved_gguf_or_'', svc_key)``; the key is the file when we have
one (so different deepseek4 models get their own server), else ``model_id``.
"""
resolved = ""
for cand in (model_file, getattr(cfg, "model_path", "") or ""):
cand = os.path.expanduser((cand or "").strip())
if cand and cand.lower().endswith(".gguf") and os.path.isfile(cand):
resolved = cand
break
svc_key = resolved or (getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4")
return resolved, svc_key
def ensure_service(cfg, model_file: Optional[str] = None,
ctx: Optional[int] = None,
ready_timeout: float = 3600.0) -> str:
"""Build (as needed), then start (or reuse) ds4-server serving the right GGUF.
``model_file`` is the requested model's path; when it resolves to a local
``.gguf`` (or ``cfg.model_path`` is set) ds4-server loads THAT via ``-m`` and
no weights are downloaded. Only when neither resolves does it fall back to
downloading ``cfg.model_variant``. ``ctx`` overrides the ds4 global context
(so the per-model n_ctx configuration wins). Returns the base URL.
"""
model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
resolved, svc_key = resolve_service_key(cfg, model_file)
with _lock:
svc = _services.get(model_id)
svc = _services.get(svc_key)
if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
return svc["url"]
if svc and svc["proc"].poll() is not None:
_services.pop(model_id, None) # died — restart below
_services.pop(svc_key, None) # died — restart below
binary = ensure_built(cfg)
ensure_model(cfg)
if not resolved:
# No local deepseek4 GGUF resolved. Downloading ds4's own variant is
# OPT-IN (auto_download, off by default) — otherwise fail with a clear
# message instead of silently pulling tens of GB.
if bool(getattr(cfg, "auto_download", False)):
ensure_model(cfg)
else:
raise RuntimeError(
"ds4: no local deepseek4 GGUF resolved for this request and "
"auto-download is disabled. Point the model at a deepseek4 "
".gguf (or set ds4.model_path), or enable ds4.auto_download to "
"fetch the configured weight variant.")
install_dir = _install_dir(cfg)
host = getattr(cfg, "host", "127.0.0.1") or "127.0.0.1"
......@@ -220,21 +255,36 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
connect_host = "127.0.0.1" if host in ("0.0.0.0", "::") else host
url = f"http://{connect_host}:{port}"
# Per-model n_ctx (passed in) wins over the ds4 global ctx setting.
try:
ctx_val = int(ctx) if ctx else 0
except (TypeError, ValueError):
ctx_val = 0
if ctx_val <= 0:
ctx_val = int(getattr(cfg, "ctx", 100000) or 100000)
cmd = [str(binary), "--host", host, "--port", str(port),
"--ctx", str(int(getattr(cfg, "ctx", 100000) or 100000)),
"--ctx", str(ctx_val),
"--chdir", str(install_dir)]
if resolved:
cmd += ["-m", resolved]
if bool(getattr(cfg, "ssd_streaming", False)):
# Stream MoE experts from SSD/disk instead of full residency — lets a
# 100GB+ model run on a small GPU + modest RAM (slow but works).
cmd += ["--ssd-streaming"]
extra = (getattr(cfg, "extra_args", "") or "").strip()
if extra:
import shlex
cmd += shlex.split(extra)
print(f"[ds4] launching ds4-server: {' '.join(cmd)}", flush=True)
proc = subprocess.Popen(
cmd, cwd=str(install_dir), stdout=subprocess.PIPE,
stderr=subprocess.STDOUT, text=True, bufsize=1,
)
tail = collections.deque(maxlen=15)
threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
_services[model_id] = {"proc": proc, "port": port, "url": url}
_services[svc_key] = {"proc": proc, "port": port, "url": url}
def _tail_msg():
joined = " | ".join(list(tail)[-5:]).strip()
......@@ -247,11 +297,11 @@ def ensure_service(cfg, ready_timeout: float = 3600.0) -> str:
f"ds4-server exited (code {proc.returncode}) before becoming ready"
+ _tail_msg())
if _health_ok(url):
print(f"[ds4] service ready for {model_id} at {url}", flush=True)
print(f"[ds4] service ready for {svc_key} at {url}", flush=True)
return url
time.sleep(2)
stop_service(model_id)
raise RuntimeError(f"ds4-server for {model_id} did not become ready in time"
stop_service(svc_key)
raise RuntimeError(f"ds4-server for {svc_key} did not become ready in time"
+ _tail_msg())
......
......@@ -35,6 +35,7 @@ class Ds4Backend(ModelBackend):
cfg = Ds4Config()
self._cfg = cfg
self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
self._svc_key: Optional[str] = None # ds4_worker service key (file or model_id)
self._url: Optional[str] = None
self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
self._last_usage: Dict = {}
......@@ -46,7 +47,42 @@ class Ds4Backend(ModelBackend):
from codai.api import ds4_worker
if model_name:
self._model_id = model_name
self._url = ds4_worker.ensure_service(self._cfg)
# Honour the model's configured context window (n_ctx / ctx from its
# models.json entry, forwarded by the manager) over the ds4 global ctx.
_ctx = kwargs.get("n_ctx", kwargs.get("ctx"))
if isinstance(_ctx, (list, tuple)):
_ctx = _ctx[0] if _ctx else None
try:
_ctx = int(_ctx) if _ctx else 0
except (TypeError, ValueError):
_ctx = 0
if _ctx > 0:
self._ctx = _ctx
# Resolve the requested model to a concrete .gguf so ds4-server serves THE
# deepseek4 model that was asked for (not a fixed downloaded variant).
model_file = self._resolve_gguf(model_name)
_resolved, self._svc_key = ds4_worker.resolve_service_key(self._cfg, model_file)
self._url = ds4_worker.ensure_service(
self._cfg, model_file=model_file, ctx=(self._ctx or None))
@staticmethod
def _resolve_gguf(model_name: str):
"""Map a requested model name/path to a local .gguf path, if one exists."""
import os
if not model_name:
return None
cand = os.path.expanduser(model_name)
if cand.lower().endswith(".gguf") and os.path.isfile(cand):
return cand
# Bare filename / alias → look it up in the GGUF cache.
try:
from codai.models.cache import get_cached_model_path
p = get_cached_model_path(model_name)
if p and str(p).lower().endswith(".gguf") and os.path.isfile(p):
return str(p)
except Exception:
pass
return None
def get_model_name(self) -> str:
return self._model_id
......@@ -59,7 +95,8 @@ class Ds4Backend(ModelBackend):
def cleanup(self) -> None:
from codai.api import ds4_worker
ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id))
key = getattr(self, "_svc_key", None) or getattr(self._cfg, "model_id", self._model_id)
ds4_worker.stop_service(key)
self._url = None
# ------------------------------------------------------------------ #
......
......@@ -229,11 +229,18 @@ class Ds4Config:
repo_url: str = "https://github.com/antirez/ds4"
install_dir: Optional[str] = None # None = ~/.coderai/ds4
build_target: str = "auto" # auto|cuda-generic|cuda-spark|metal|cpu
model_variant: str = "q4-imatrix" # download_model.sh variant
# The model ds4-server loads. Preferred: serve a deepseek4 GGUF the user
# already has — the requested model's own path is used when it resolves to a
# local .gguf, else `model_path` (an explicit override), else the variant is
# downloaded as a last resort. So you normally DON'T set model_variant at all.
model_path: str = "" # explicit GGUF for ds4-server -m (overrides the download)
auto_download: bool = False # OFF by default: only download a variant when explicitly opted in
model_variant: str = "q4-imatrix" # download_model.sh variant (used only when auto_download is on)
model_id: str = "deepseek-v4" # model id/alias that routes to ds4
host: str = "127.0.0.1"
port: int = 0 # 0 = auto-pick a free port
ctx: int = 100000 # ds4-server --ctx context window
ssd_streaming: bool = False # ds4-server --ssd-streaming: stream experts from SSD/disk
extra_args: str = "" # extra flags passed to ds4-server
auto_build: bool = True # clone+build the binary if it's missing
......@@ -579,11 +586,14 @@ class ConfigManager:
"repo_url": self.config.ds4.repo_url,
"install_dir": self.config.ds4.install_dir,
"build_target": self.config.ds4.build_target,
"model_path": self.config.ds4.model_path,
"auto_download": self.config.ds4.auto_download,
"model_variant": self.config.ds4.model_variant,
"model_id": self.config.ds4.model_id,
"host": self.config.ds4.host,
"port": self.config.ds4.port,
"ctx": self.config.ds4.ctx,
"ssd_streaming": self.config.ds4.ssd_streaming,
"extra_args": self.config.ds4.extra_args,
"auto_build": self.config.ds4.auto_build,
},
......
......@@ -47,11 +47,85 @@ def get_active_ds4_config():
return None
_GGUF_ARCH_CACHE: Dict[tuple, str] = {}
def _resolve_local_gguf(model_name: str):
"""Map a model name/alias/path to a local .gguf file path, or None."""
if not model_name:
return None
cand = os.path.expanduser(model_name)
if cand.lower().endswith(".gguf") and os.path.isfile(cand):
return cand
try:
p = get_cached_model_path(model_name)
if p and str(p).lower().endswith(".gguf") and os.path.isfile(str(p)):
return str(p)
except Exception:
pass
return None
def _gguf_architecture(path: str):
"""Read ``general.architecture`` from a GGUF header. Cached by (path,mtime,size)."""
import struct
try:
st = os.stat(path)
key = (path, st.st_mtime_ns, st.st_size)
except OSError:
return None
if key in _GGUF_ARCH_CACHE:
return _GGUF_ARCH_CACHE[key] or None
arch = ""
_sz = {0: 1, 1: 1, 7: 1, 2: 2, 3: 2, 4: 4, 5: 4, 6: 4, 10: 8, 11: 8, 12: 8}
try:
with open(path, "rb") as f:
if f.read(4) != b"GGUF":
_GGUF_ARCH_CACHE[key] = ""
return None
f.read(4); f.read(8) # version, tensor_count
kv_count = struct.unpack("<Q", f.read(8))[0]
def _rs(fh):
n = struct.unpack("<Q", fh.read(8))[0]
return fh.read(n)
for _ in range(kv_count):
k = _rs(f)
vtype = struct.unpack("<I", f.read(4))[0]
if vtype == 8: # string
v = _rs(f)
if k == b"general.architecture":
arch = v.decode("utf-8", "ignore")
break
elif vtype in _sz:
f.read(_sz[vtype])
elif vtype == 9: # array — skip its elements
atype = struct.unpack("<I", f.read(4))[0]
alen = struct.unpack("<Q", f.read(8))[0]
if atype == 8:
for _ in range(alen):
_rs(f)
elif atype in _sz:
f.read(_sz[atype] * alen)
else:
break # unknown element type — stop
else:
break # unknown value type — stop
except Exception:
arch = ""
_GGUF_ARCH_CACHE[key] = arch
return arch or None
def ds4_should_handle(model_name: str) -> bool:
"""True when ds4 is enabled and ``model_name`` should be served by ds4-server.
"""True when ds4 is enabled and ``model_name`` is a DeepSeek-V4 (ds4) model.
Matches the configured ``model_id`` (case-insensitive, short-name aware) or any
name containing ``deepseek-v4``, so the stock alias works without extra config.
Routing is by the GGUF ARCHITECTURE, not the filename: ds4 serves only genuine
``deepseek4`` GGUFs (its own format). Mainline DeepSeek GGUFs (deepseek/
deepseek2/deepseek3/deepseek32) are left to llama.cpp. The configured
``model_id`` alias still routes (covers the variant ds4 downloads itself, which
has no local file yet).
"""
if not model_name:
return False
......@@ -63,7 +137,13 @@ def ds4_should_handle(model_name: str) -> bool:
mid = (getattr(cfg, "model_id", "") or "").lower()
if mid and (name == mid or short == mid):
return True
return "deepseek-v4" in name
# Definitive: read the GGUF architecture for a local file — only deepseek4.
path = _resolve_local_gguf(model_name)
if path:
return (_gguf_architecture(path) or "").lower() == "deepseek4"
# No local file (HF id / not downloaded yet): conservative name check that
# matches ONLY the V4 marker, so mainline deepseek GGUFs aren't grabbed.
return "deepseek-v4" in name or "deepseek4" in name
def _trim_cpu_ram() -> None:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment