Merge feat/township-match-upload: to-download list, mmproj vision, styled...

Merge feat/township-match-upload: to-download list, mmproj vision, styled modals, broker + packaging

Merge feat/township-match-upload: to-download list, mmproj vision, styled...
Merge feat/township-match-upload: to-download list, mmproj vision, styled modals, broker + packaging
766fef3c · Stefy Lanza (nextime / spora ) · 56291911 · cbf7f147 · 766fef3c · 766fef3c
Commit 766fef3c authored Jun 19, 2026 by Stefy Lanza (nextime / spora )
73 changed files
--- a/.dockerignore
+++ b/.dockerignore
@@ -21,6 +21,18 @@ township_output
 dist
 dist-package
 *.log
+tmp
+debug.log
+CoderAI.gif
+
+# Produced artifacts and tool session/output dirs (mounted as volumes at runtime,
+# never baked into the image)
+video_editor/sessions
+video_editor.config.json
+tools/videogen_output
+tools/township_output
+tools/coderai_media
+samples

 # Build outputs
 build

--- a/.gitignore
+++ b/.gitignore
@@ -17,6 +17,15 @@ __pycache__/

 # Debug logs
 debug.log
+/logs/
+
+# Runtime model cache (downloads, self-quantized checkpoints, job state).
+# Root-anchored so it never shadows the tracked codai/models/ source package.
+/models/
+
+# Third-party source clone of the GPTQ quantizer — installed into the venv from
+# source; the working tree is not part of this repo (it has its own .git).
+/GPTQModel/

 # Test files
 test_*.py
@@ -33,3 +42,11 @@ township_output/
 # Packaging build cache + runtime temp (large artifacts)
 .packaging-cache/
 tmp/
+
+# Exported image tarballs + local OCI run-state (large artifacts)
+dist/
+coderai-runtime/
+
+# Video editor sessions + generated media (runtime artifacts)
+video_editor/sessions/
+tools/coderai_media/
--- a/AI.PROMPT
+++ b/AI.PROMPT
@@ -286,3 +286,67 @@ safe.
 14. Thermal protection is config-driven and model-agnostic (config.json
    `thermal`). Don't special-case it per model/backend; it only reads temps and
    sleeps. Honour the enable flags and high/resume hysteresis.
+
+================================================================================
+## Distributable Docker image (packaging/linux)
+================================================================================
+
+All-in-one image: coderai + tools (editor/videogen/township) behind nginx on a
+single port (8776), built from the LOCAL install's venv + binaries.
+
+Multi-stage `Dockerfile.oci-venv`:
+  - assembler stage stages the local bundle into /opt/coderai (python-build-
+    standalone interpreter + venv site-packages + ldd'd native libs + parler
+    overlay + lip-sync venv/repos + py310 + ds4). The ~20 GB bundle COPY lives
+    ONLY here; the runtime stage COPYs the assembled tree ONCE (no double-store).
+  - runtime stage: apt (nginx/supervisor/vulkan-tools/ffmpeg/...), COPY the
+    assembled /opt/coderai, then COPY app code → /opt/coderai/app, launchers →
+    /usr/local/bin, nginx/supervisor confs. Entry = coderai-entrypoint →
+    supervisord (nginx + main server + tool UIs).
+  - Do NOT set PYTHONHOME globally (breaks the system-python supervisord); set
+    PATH only. Bundle dereferences host symlinks (cp -aL) so binaries like
+    whisper-server are real files in the image, not dangling links.
+
+Full build (slow, ~15 min — rebuilds the bundle):
+  packaging/linux/build_oci_image.sh                      # tags coderai:dist
+Smoke test (no weights, checks services + every bundled binary):
+  DOCKER="sudo docker" GPU="--gpus all" PORT=18082 \
+    packaging/linux/smoke_test_services.sh coderai:dist
+
+Run against your LIVE local config + data (no rebuild — pure bind-mounts):
+  packaging/linux/run_oci.sh --nvidia --local \
+    --map /AI/guffcache --map /AI/huggingface --map /AI/offloads
+  - The image launcher reads config from /config/coderai and runs
+    `coderai --config /config/coderai`, rewriting server.host/port in config.json.
+  - `--local` (= --config-dir ~/.coderai) copies ONLY the *.json config files to
+    a temp dir and mounts it at /config/coderai, so your real config is untouched
+    (use --inplace-config to edit it directly).
+  - `--map HOST[:CONT]` bind-mounts a host dir at the SAME path inside the
+    container so the ABSOLUTE paths in models.json/config.json (gguf/hf caches,
+    offloads) resolve unchanged. Without these maps the models won't be found.
+  - `--debug[=SPEC]` runs coderai with --debug* flags (SPEC default 'all';
+    e.g. `--debug=engine,requests,ws` → --debug-engine/--debug-requests/--debug-ws,
+    `--debug` always auto-added) and writes a host-tailable file log. `--log-file
+    PATH` sets the in-container log path (default /cache/logs/coderai.log → host
+    under the cache mount). Driven by env CODERAI_DEBUG + CODERAI_LOG_FILE, read
+    by the coderai-oci launcher, which tees output so `docker logs` still works.
+    supervisord [program:coderai] uses stopasgroup/killasgroup so the front's
+    engine subprocesses + the tee are torn down together. NOTE: the launcher +
+    supervisord.conf are baked in, so changes need a (fast) update_oci_image.sh.
+
+Incremental update (FAST, ~30 s — code-only changes, NO bundle recopy):
+  DOCKER="sudo docker" packaging/linux/update_oci_image.sh
+  - `Dockerfile.update` is `FROM coderai:base` and re-layers ONLY the app code +
+    launchers + service confs. The heavy bundle layers are inherited unchanged.
+  - Keeps an immutable `coderai:base` (the bundle) and rebuilds `coderai:dist`
+    as base + a thin app layer. Every update starts from the SAME base, so app
+    layers never stack across updates. dist and base SHARE the bundle layers —
+    keeping both costs only the app layer (a few MB), not a second 23 GB.
+  - First run seeds coderai:base from the current coderai:dist (docker tag).
+  - Re-baseline the bundle (new venv/libs/tools): run build_oci_image.sh, then
+    `docker rmi coderai:base` so the next update re-seeds it from the new dist.
+  - Use this whenever ONLY codai/ app code (or launchers/confs) changed — a full
+    build_oci_image.sh is wasteful for that.
+  - CAUTION: COPY adds/overwrites but does NOT delete files removed from the
+    repo; the cleanup RUN prunes only known-stale paths (.git/venv*/dist/...). A
+    source file deleted from codai/ lingers in the overlay until a full rebuild.
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@

 ![CoderAI](CoderAI.gif)

-An OpenAI-compatible API server to run models on your local GPU with web administration dashboard, supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Configuration-driven architecture with per-model settings and full multi-modal support.
+A multimodal and multi-backend local model orchestrator with an OpenAI-compatible API server to run models on local GPUs, supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Configuration-driven architecture with per-model settings and full multi-modal support.

 ## Features


--- a/build.sh
+++ b/build.sh
@@ -35,12 +35,13 @@ BACKEND="${1:-all}"
 FLASH=false
 CUSTOM_VENV=""
 PACKAGE=false
+DS4=false

 # Parse arguments
 i=1
 for arg in "$@"; do
    case $arg in
-        --flash) 
+        --flash)
            FLASH=true
            ;;
        --venv)
@@ -50,6 +51,9 @@ for arg in "$@"; do
        --package)
            PACKAGE=true
            ;;
+        --ds4)
+            DS4=true
+            ;;
    esac
    i=$((i + 1))
 done
@@ -68,6 +72,7 @@ if [[ "$BACKEND" != "nvidia" && "$BACKEND" != "vulkan" && "$BACKEND" != "vulkan-
    echo ""
    echo "Options:"
    echo "  --flash     - Install Flash Attention 2 for faster inference (NVIDIA only)"
+    echo "  --ds4       - Clone + build the ds4 (DeepSeek V4) native engine"
    exit 1
 fi

@@ -755,6 +760,35 @@ package_app() {
    echo -e "${YELLOW}Note: The target machine must still provide compatible system GPU/runtime libraries.${NC}"
 }

+# Optionally clone + build ds4 (DeepSeek V4 native engine). Opt-in via --ds4.
+# coderai can also auto-build this at runtime on first use, but doing it here lets
+# the OCI/Docker packaging bundle the prebuilt ds4-server binary.
+build_ds4() {
+    local DS4_DIR="${CODERAI_DS4_DIR:-$HOME/.coderai/ds4}"
+    echo -e "${YELLOW}Building ds4 (DeepSeek V4 engine) → $DS4_DIR ...${NC}"
+    if [ ! -e "$DS4_DIR/Makefile" ]; then
+        mkdir -p "$(dirname "$DS4_DIR")"
+        git clone --depth 1 https://github.com/antirez/ds4 "$DS4_DIR" || {
+            echo -e "${YELLOW}Warning: could not clone ds4; skipping.${NC}"; return 0; }
+    fi
+    local TARGET="cpu"
+    if command -v nvcc &> /dev/null || [ -d "/usr/local/cuda" ]; then
+        TARGET="cuda-generic"
+    elif [ "$(uname -s)" = "Darwin" ]; then
+        TARGET=""   # bare `make` builds the macOS Metal backend
+    fi
+    ( cd "$DS4_DIR" && make $TARGET ) || {
+        echo -e "${YELLOW}Warning: ds4 build failed; it can still be built at runtime.${NC}"; return 0; }
+    if [ -x "$DS4_DIR/ds4-server" ]; then
+        echo -e "${GREEN}✓ ds4-server built at $DS4_DIR/ds4-server${NC}"
+        echo -e "${YELLOW}Note: DeepSeek V4 weights are downloaded on first use (multi-GB).${NC}"
+    fi
+}
+
+if [ "$DS4" = true ]; then
+    build_ds4
+fi
+
 # Create .backend file to track which backend was used
 echo "$BACKEND" > .backend


--- a/codai/admin/routes.py
+++ b/codai/admin/routes.py
--- a/codai/admin/templates/archive.html
+++ b/codai/admin/templates/archive.html
@@ -335,7 +335,7 @@ async function deleteEntry() {
    closeDetail();
    loadArchive();
  } catch(e) {
-    alert('Delete failed: ' + e.message);
+    showAlert('Delete failed: ' + e.message);
  }
 }


--- a/codai/admin/templates/base.html
+++ b/codai/admin/templates/base.html
@@ -104,6 +104,81 @@ function donateCopy(id, btn) {
 </main>
 {% endif %}

+<!-- Shared confirm / notice modal (replaces window.confirm / window.alert) -->
+<div id="confirm-modal" class="modal" onclick="if(event.target===this)document.getElementById('confirm-modal-cancel').click()">
+  <div class="modal-box" style="max-width:420px">
+    <div class="modal-head">
+      <span class="modal-title" id="confirm-modal-title">Confirm</span>
+      <button class="modal-close" id="confirm-modal-x">&times;</button>
+    </div>
+    <div class="modal-body">
+      <p id="confirm-modal-msg" style="margin:0 0 1.25rem;white-space:pre-wrap"></p>
+      <div style="display:flex;gap:.5rem;justify-content:flex-end">
+        <button class="btn btn-ghost" id="confirm-modal-cancel">Cancel</button>
+        <button class="btn btn-danger" id="confirm-modal-ok">Confirm</button>
+      </div>
+    </div>
+  </div>
+</div>
+<script>
+// Global modal helpers, shared by every admin page. Defined here so templates
+// can call showAlert()/showConfirm() instead of window.alert()/window.confirm().
+if(typeof window.openModal!=='function') window.openModal=function(id){document.getElementById(id).classList.add('show')};
+if(typeof window.closeModal!=='function') window.closeModal=function(id){document.getElementById(id).classList.remove('show')};
+
+window.showConfirm=function(title, msg, okLabel){
+  return new Promise(resolve => {
+    document.getElementById('confirm-modal-title').textContent = title;
+    document.getElementById('confirm-modal-msg').textContent = msg;
+    const okBtn    = document.getElementById('confirm-modal-ok');
+    const cancelBtn= document.getElementById('confirm-modal-cancel');
+    const xBtn     = document.getElementById('confirm-modal-x');
+    okBtn.className = 'btn btn-danger';
+    okBtn.textContent = okLabel || 'Confirm';
+    cancelBtn.style.display = '';
+    openModal('confirm-modal');
+    function cleanup(result){
+      closeModal('confirm-modal');
+      okBtn.removeEventListener('click', onOk);
+      cancelBtn.removeEventListener('click', onCancel);
+      xBtn.removeEventListener('click', onCancel);
+      resolve(result);
+    }
+    function onOk(){ cleanup(true); }
+    function onCancel(){ cleanup(false); }
+    okBtn.addEventListener('click', onOk);
+    cancelBtn.addEventListener('click', onCancel);
+    xBtn.addEventListener('click', onCancel);
+  });
+};
+
+// Styled replacement for window.alert(): a single-button notice modal.
+window.showAlert=function(msg, title, kind){
+  return new Promise(resolve => {
+    if(!title && !kind && /^\s*(error|failed|cannot|could not)\b/i.test(String(msg||''))) kind = 'error';
+    document.getElementById('confirm-modal-title').textContent =
+      title || (kind === 'error' ? 'Error' : 'Notice');
+    document.getElementById('confirm-modal-msg').textContent = msg;
+    const okBtn     = document.getElementById('confirm-modal-ok');
+    const cancelBtn = document.getElementById('confirm-modal-cancel');
+    const xBtn      = document.getElementById('confirm-modal-x');
+    okBtn.className = 'btn btn-primary';
+    okBtn.textContent = 'OK';
+    cancelBtn.style.display = 'none';
+    openModal('confirm-modal');
+    function cleanup(){
+      closeModal('confirm-modal');
+      cancelBtn.style.display = '';
+      okBtn.removeEventListener('click', onOk);
+      xBtn.removeEventListener('click', onOk);
+      resolve();
+    }
+    function onOk(){ cleanup(); }
+    okBtn.addEventListener('click', onOk);
+    xBtn.addEventListener('click', onOk);
+  });
+};
+</script>
 {% block scripts %}{% endblock %}
 </body>
 </html>
--- a/codai/admin/templates/chat.html
+++ b/codai/admin/templates/chat.html
@@ -2372,7 +2372,7 @@ const STUDIO_CAPABILITIES = {
    optional:[],
    notes:[
      'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
-      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
+      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
      'No AI model selection needed — this feature uses its own dedicated backend.',
    ],
    backendPath: ROOT_PATH + '/v1/images/faceswap',
@@ -2386,7 +2386,7 @@ const STUDIO_CAPABILITIES = {
    optional:[],
    notes:[
      'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
-      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
+      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
      'No AI model selection needed — this feature uses its own dedicated backend.',
    ],
    backendPath: ROOT_PATH + '/v1/images/faceswap',
@@ -2461,14 +2461,14 @@ function capSearchUrl(cap) {
  const s = CAP_TO_HF_SEARCH[cap];
  if (!s) return null;
  const p = new URLSearchParams({ tab:'search', q: s.q, pipeline: s.pipeline, gguf: s.gguf });
-  return '/admin/models?' + p.toString();
+  return (window.ROOT_PATH || '') + '/admin/models?' + p.toString();
 }
 function capMissingHtml(caps, label) {
  if (!caps.length) return '';
  const links = caps.map(cap => {
    const chip = `<span class="cap-chip dim">${cap.replace(/_/g,' ')}</span>`;
    if (_localCapSet.has(cap)) {
-      const url = `/admin/models?local_cap=${encodeURIComponent(cap)}`;
+      const url = `${window.ROOT_PATH || ''}/admin/models?local_cap=${encodeURIComponent(cap)}`;
      return `<a href="${url}" class="cap-find-link" title="You have a local model with ${cap.replace(/_/g,' ')} — click to configure it">${chip}<span class="cap-find-icon" style="color:#6ecf7e">↑ configure</span></a>`;
    }
    const url = capSearchUrl(cap);
@@ -4229,12 +4229,12 @@ async function loadCharProfileIntoSlot(prefix, idx, name) {
    charSlots[prefix][idx].name = charSlots[prefix][idx].name || d.name;
    charSlots[prefix][idx].images = (d.images||[]).map(img => img.data);
    renderCharSlots(prefix);
-  } catch(e) { alert('Failed to load profile: '+e.message); }
+  } catch(e) { showAlert('Failed to load profile: '+e.message); }
 }

 async function saveCharSlotAsProfile(prefix, idx) {
  const slot = charSlots[prefix]?.[idx];
-  if (!slot || !slot.images.length) { alert('Add at least one image first.'); return; }
+  if (!slot || !slot.images.length) { showAlert('Add at least one image first.'); return; }
  const name = slot.name || prompt('Profile name:');
  if (!name) return;
  try {
@@ -4246,8 +4246,8 @@ async function saveCharSlotAsProfile(prefix, idx) {
    charSlots[prefix][idx].name = name;
    await loadCharProfileList();
    renderCharSlots(prefix);
-    alert(`Saved profile "${name}"`);
-  } catch(e) { alert('Save failed: '+e.message); }
+    showAlert(`Saved profile "${name}"`);
+  } catch(e) { showAlert('Save failed: '+e.message); }
 }

 // ─────────────────────────────────────────────────────────────────
@@ -6051,14 +6051,14 @@ async function profCharView(name) {
  try {
    const d = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name)).then(r=>r.json());
    _openProfModal(`Character: ${d.name}`, d.description||'', d.images||[]);
-  } catch(e) { alert('Failed to load character: ' + e.message); }
+  } catch(e) { showAlert('Failed to load character: ' + e.message); }
 }

 async function profCharDelete(name) {
  if (!confirm(`Delete character profile "${name}"?`)) return;
  const r = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name), {method:'DELETE'});
  if (r.ok) await profCharLoad();
-  else alert('Delete failed: ' + await r.text());
+  else showAlert('Delete failed: ' + await r.text());
 }


@@ -6139,7 +6139,7 @@ async function profVoiceDelete(name) {
  if (!confirm(`Delete voice profile "${name}"?`)) return;
  const r = await fetch(ROOT_PATH + '/admin/api/voices/'+encodeURIComponent(name), {method:'DELETE'});
  if (r.ok) await profVoiceLoad();
-  else alert('Delete failed: ' + await r.text());
+  else showAlert('Delete failed: ' + await r.text());
 }

 // ─────────────────────────────────────────────────────────────────
@@ -6296,14 +6296,14 @@ async function profEnvView(name) {
  try {
    const d = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name)).then(r=>r.json());
    _openProfModal(`Environment: ${d.name}`, d.description||'', d.images||[]);
-  } catch(e) { alert('Failed to load environment: ' + e.message); }
+  } catch(e) { showAlert('Failed to load environment: ' + e.message); }
 }

 async function profEnvDelete(name) {
  if (!confirm(`Delete environment profile "${name}"?`)) return;
  const r = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name), {method:'DELETE'});
  if (r.ok) await profEnvLoad();
-  else alert('Delete failed: ' + await r.text());
+  else showAlert('Delete failed: ' + await r.text());
 }

 // ─────────────────────────────────────────────────────────────────
@@ -6528,7 +6528,7 @@ async function deleteCustomPipeline(id) {
    _customPipelines = _customPipelines.filter(p => p.id !== id);
    if (_editingPipelineId === id) { _editingPipelineId = null; _pbSteps = []; renderBuilderSteps(); }
    renderCustomPipelineCards();
-  } catch(e) { alert('Delete failed: '+e.message); }
+  } catch(e) { showAlert('Delete failed: '+e.message); }
 }

 function _renderPipelineResult(outId, progId, d) {
@@ -6683,7 +6683,7 @@ async function archiveDelete(filename) {
    _archiveFiles = _archiveFiles.filter(f => f.filename !== filename);
    renderArchive();
  } catch(e) {
-    alert('Delete failed: ' + e.message);
+    showAlert('Delete failed: ' + e.message);
  }
 }


--- a/codai/admin/templates/models.html
+++ b/codai/admin/templates/models.html
--- a/codai/admin/templates/settings.html
+++ b/codai/admin/templates/settings.html
--- a/codai/admin/templates/tasks.html
+++ b/codai/admin/templates/tasks.html
--- a/codai/admin/templates/tokens.html
+++ b/codai/admin/templates/tokens.html
@@ -126,15 +126,15 @@ async function createToken() {
      openModal('show-modal');
      loadTokens();
    } else {
-      const e = await r.json(); alert(e.detail || 'Failed');
+      const e = await r.json(); showAlert(e.detail || 'Failed');
    }
-  } catch (e) { alert(e.message); }
+  } catch (e) { showAlert(e.message); }
 }

 async function delToken(id) {
  if (!confirm('Delete this token? Clients using it will lose access immediately.')) return;
  const r = await fetch(ROOT_PATH + '/admin/api/tokens/'+id, {method:'DELETE'});
-  if (r.ok) loadTokens(); else alert('Failed to delete');
+  if (r.ok) loadTokens(); else showAlert('Failed to delete');
 }

 loadTokens();

--- a/codai/admin/templates/users.html
+++ b/codai/admin/templates/users.html
@@ -105,7 +105,7 @@ async function delUser(id, name) {
  if (!confirm('Delete user "' + name + '"?')) return;
  const r = await fetch(ROOT_PATH + '/admin/api/users/'+id, {method:'DELETE'});
  if (r.ok) location.reload();
-  else { const e = await r.json(); alert(e.detail || 'Failed'); }
+  else { const e = await r.json(); showAlert(e.detail || 'Failed'); }
 }
 </script>
 {% endblock %}
--- a/codai/api/app.py
+++ b/codai/api/app.py
@@ -160,6 +160,32 @@ except ImportError:
    pass


+class _InternalAuthMiddleware:
+    """Reject any HTTP request that doesn't carry the front's internal token.
+
+    Active only when CODERAI_INTERNAL_TOKEN is set (i.e. this process is an engine
+    spawned by the front). It binds 127.0.0.1, but this also blocks anything else on
+    localhost from talking to the engine directly and bypassing the front. In
+    single-process mode the token is unset and this is a no-op."""
+
+    def __init__(self, app):
+        self._app = app
+        self._token = os.environ.get("CODERAI_INTERNAL_TOKEN")
+
+    async def __call__(self, scope, receive, send):
+        if self._token and scope.get("type") == "http":
+            headers = dict(scope.get("headers", []))
+            got = headers.get(b"x-coderai-internal", b"").decode("latin-1")
+            if got != self._token:
+                await send({"type": "http.response.start", "status": 403,
+                            "headers": [(b"content-type", b"application/json")]})
+                await send({"type": "http.response.body",
+                            "body": b'{"error":"forbidden: engines are reachable only '
+                                    b'through the front proxy"}'})
+                return
+        await self._app(scope, receive, send)
+
+
 class _ForwardedPrefixMiddleware:
    """Populate ASGI root_path from X-Forwarded-Prefix / X-Script-Name headers."""

@@ -180,6 +206,9 @@ class _ForwardedPrefixMiddleware:


 app.add_middleware(_ForwardedPrefixMiddleware)
+# Added last → outermost: the internal-token gate runs before anything else, so a
+# request without the front's token never reaches a route.
+app.add_middleware(_InternalAuthMiddleware)

 # Mount static files for admin dashboard
 from fastapi.staticfiles import StaticFiles
@@ -193,6 +222,77 @@ from fastapi.responses import FileResponse, Response as _FaviconResponse
 _favicon_path = admin_static_dir / "favicon.ico"


+@app.get("/healthz", include_in_schema=False)
+async def healthz():
+    """Cheap liveness probe that touches no torch/model state.
+
+    The front proxy's engine supervisor polls this to distinguish a *slow* engine
+    (busy loading a model — the event loop may be blocked, so this can be late but
+    will eventually answer) from a *dead* one (connection refused). It must stay
+    trivial and dependency-free so it returns the instant the loop is free."""
+    import os as _os
+    return {"ok": True, "pid": _os.getpid()}
+
+
+@app.get("/internal/engine-state", include_in_schema=False)
+async def internal_engine_state():
+    """Auth-free engine introspection for the front proxy's router/aggregator.
+
+    Engines bind 127.0.0.1 only, so this is not publicly reachable. Returns which
+    models are resident (for model→engine routing) and this engine's GPU/VRAM (for
+    cross-engine status aggregation). Kept cheap so it answers even mid-generation.
+    """
+    import os as _os
+    try:
+        loaded = list(multi_model_manager.models.keys())
+    except Exception:
+        loaded = []
+    vram = None
+    try:
+        import torch
+        if torch.cuda.is_available():
+            # Sum across every CUDA device this engine can see — an engine may own
+            # more than one GPU (e.g. two NVIDIA cards sharding one large model), so
+            # reporting only device 0 would under-count its VRAM.
+            n = torch.cuda.device_count()
+            used = free = total = 0
+            devs = []
+            for i in range(n):
+                f, t = torch.cuda.mem_get_info(i)
+                used += (t - f); free += f; total += t
+                devs.append({"index": i, "name": torch.cuda.get_device_name(i),
+                             "free": round(f / 1e9, 2), "total": round(t / 1e9, 2)})
+            label = (torch.cuda.get_device_name(0) if n == 1
+                     else f"{n}× CUDA")
+            vram = {"used": round(used / 1e9, 2), "free": round(free / 1e9, 2),
+                    "total": round(total / 1e9, 2), "gpu": label,
+                    "devices": devs, "device_count": n}
+    except Exception:
+        vram = None
+    # Running tasks so the front can show cross-engine activity without needing a
+    # session on this engine (sessions live only on the primary).
+    tasks = []
+    try:
+        from codai.tasks import task_registry
+        tasks = [t for t in task_registry.list()
+                 if t.get("status") in ("running", "queued", "paused")]
+    except Exception:
+        tasks = []
+    # This engine's thermal cooldown state, so the front can show WHICH engine is
+    # cooling (each engine pauses on its own GPUs; CPU pauses everything).
+    cooling = None
+    try:
+        from codai.models import thermal
+        cs = thermal.get_cooldown_state()
+        if cs.get("active"):
+            cooling = {"gpu": cs.get("gpu"), "cpu": cs.get("cpu"),
+                       "message": cs.get("message")}
+    except Exception:
+        cooling = None
+    return {"ok": True, "pid": _os.getpid(), "loaded_models": loaded,
+            "vram": vram, "tasks": tasks, "cooling": cooling}
+
+
 @app.get("/favicon.ico", include_in_schema=False)
 async def favicon():
    if _favicon_path.exists():

--- a/codai/api/ds4_worker.py
+++ b/codai/api/ds4_worker.py
--- a/codai/api/embeddings.py
+++ b/codai/api/embeddings.py
@@ -106,6 +106,27 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
    """
    OpenAI-compatible embeddings endpoint.
    """
+    # Register a task so embeddings appear in the unified task list, like every
+    # other model type. Finished on success or error below.
+    from codai.tasks import task_registry
+    _title = request.input if isinstance(request.input, str) else "embeddings"
+    _tid = task_registry.register(
+        "embedding", title=str(_title)[:80], model=(request.model or "embedding"))
+    task_registry.start(_tid)
+    try:
+        _resp = await _run_embeddings(request, http_request)
+        task_registry.finish(_tid, "done")
+        return _resp
+    except HTTPException:
+        task_registry.finish(_tid, "error")
+        raise
+    except Exception as e:
+        task_registry.finish(_tid, "error", str(e)[:200])
+        raise
+
+
+async def _run_embeddings(request: EmbeddingsRequest, http_request: Request = None):
+    """Core embeddings logic; registered as a task by create_embeddings()."""
    model_info = await asyncio.to_thread(
        multi_model_manager.request_model, request.model, model_type="embedding")
    model_name = model_info.get('model_name')

--- a/codai/api/parler_worker.py
+++ b/codai/api/parler_worker.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Fully-managed Parler-TTS worker.
+
+parler-tts pins an old transformers/tokenizers/huggingface-hub that conflict with
+the coderai server's stack, so it can't share this venv. Instead coderai owns the
+whole lifecycle here: on first use it bootstraps a dedicated venv (installing
+parler-tts), launches ``tools/parler_tts_service.py`` in it as a local HTTP
+service, health-checks it, and hands back the URL. The matching
+``_RemoteParlerBackend.cleanup()`` calls :func:`stop_service`, so the model
+manager's normal eviction tears the process down — no manual setup or config.
+"""
+
+import os
+import socket
+import subprocess
+import sys
+import threading
+import time
+from pathlib import Path
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_SERVICE_SCRIPT = _REPO_ROOT / "tools" / "parler_tts_service.py"
+
+# Dedicated venv for the (incompatible) parler-tts stack. Created with access to
+# the base interpreter's packages so torch/numpy aren't re-downloaded; parler's
+# pinned transformers installs into the venv and shadows the system one.
+_VENV_DIR = Path(os.environ.get("CODERAI_PARLER_VENV")
+                 or os.path.expanduser("~/.coderai/parler_venv"))
+
+_lock = threading.RLock()
+_services: dict[str, dict] = {}   # model_name -> {"proc","port","url"}
+_bootstrapped = False
+
+
+def _venv_python() -> Path:
+    return _VENV_DIR / ("Scripts" if os.name == "nt" else "bin") / (
+        "python.exe" if os.name == "nt" else "python")
+
+
+def _pip_ok(py: Path) -> bool:
+    try:
+        return subprocess.run([str(py), "-c", "import parler_tts, soundfile"],
+                              capture_output=True).returncode == 0
+    except Exception:
+        return False
+
+
+def _venv_is_system_site() -> bool:
+    """True if the venv was built with --system-site-packages (can't isolate)."""
+    try:
+        return "include-system-site-packages = true" in \
+            (_VENV_DIR / "pyvenv.cfg").read_text().lower()
+    except Exception:
+        return False
+
+
+def _bootstrap_venv() -> Path:
+    """Create a fully-isolated venv and install parler-tts (idempotent).
+
+    Isolation is the whole point: parler-tts pins an old transformers/tokenizers
+    that must NOT be shared with — or shadowed by — the server's stack, so the
+    venv gets its own copy of everything (torch included). Returns its python."""
+    global _bootstrapped
+    py = _venv_python()
+    if _bootstrapped and py.exists():
+        return py
+    # A previously-created shared-site venv leaks the server's transformers in;
+    # rebuild it isolated.
+    if py.exists() and _venv_is_system_site():
+        import shutil
+        print("[parler] rebuilding venv as fully isolated …", flush=True)
+        shutil.rmtree(_VENV_DIR, ignore_errors=True)
+    if not _venv_python().exists():
+        print(f"[parler] creating isolated venv at {_VENV_DIR} …", flush=True)
+        _VENV_DIR.parent.mkdir(parents=True, exist_ok=True)
+        subprocess.run([sys.executable, "-m", "venv", str(_VENV_DIR)], check=True)
+    py = _venv_python()
+    if not _pip_ok(py):
+        print("[parler] installing parler-tts + torch into the isolated venv "
+              "(first run, downloads several GB, this can take a while) …", flush=True)
+        subprocess.run([str(py), "-m", "pip", "install",
+                        "git+https://github.com/huggingface/parler-tts.git",
+                        "soundfile"], check=True)
+        if not _pip_ok(py):
+            raise RuntimeError("parler-tts install did not yield an importable package")
+    _bootstrapped = True
+    return py
+
+
+def _free_port() -> int:
+    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    s.bind(("127.0.0.1", 0))
+    port = s.getsockname()[1]
+    s.close()
+    return port
+
+
+def _pump_logs(proc: subprocess.Popen, tail):
+    for line in proc.stdout:
+        line = line.rstrip()
+        if line:
+            tail.append(line)
+            print(f"[parler] {line}", flush=True)
+
+
+def _health_ok(url: str) -> bool:
+    import requests
+    try:
+        r = requests.get(url + "/health", timeout=3)
+        return r.ok and bool(r.json().get("ok"))
+    except Exception:
+        return False
+
+
+def ensure_service(model_name: str, ready_timeout: float = 1800.0) -> str:
+    """Start (or reuse) the worker for ``model_name`` and return its base URL.
+
+    First call bootstraps the venv and downloads the model, so the timeout is
+    generous. Raises RuntimeError if the service never comes up."""
+    with _lock:
+        svc = _services.get(model_name)
+        if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
+            return svc["url"]
+        if svc and svc["proc"].poll() is not None:
+            _services.pop(model_name, None)   # died — restart below
+
+        py = _bootstrap_venv()
+        port = _free_port()
+        url = f"http://127.0.0.1:{port}"
+        env = dict(os.environ)
+        # The worker must use the model already pulled via coderai's HF download
+        # interface — it never downloads anything itself. Point it at coderai's
+        # cache and force offline mode, so a missing model fails fast instead of
+        # silently fetching.
+        try:
+            from codai.models.cache import get_hf_hub_cache_dir
+            hub = get_hf_hub_cache_dir()
+            env["HF_HUB_CACHE"] = hub
+            env["HUGGINGFACE_HUB_CACHE"] = hub
+        except Exception:
+            pass
+        env["HF_HUB_OFFLINE"] = "1"
+        env["TRANSFORMERS_OFFLINE"] = "1"
+        proc = subprocess.Popen(
+            [str(py), str(_SERVICE_SCRIPT), "--model", model_name,
+             "--host", "127.0.0.1", "--port", str(port)],
+            stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True,
+            bufsize=1, env=env, cwd=str(_REPO_ROOT),
+        )
+        import collections
+        tail = collections.deque(maxlen=15)
+        threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
+        _services[model_name] = {"proc": proc, "port": port, "url": url}
+
+    def _tail_msg():
+        joined = " | ".join(list(tail)[-5:]).strip()
+        if "offline" in joined.lower() or "not" in joined.lower() and "found" in joined.lower():
+            return (f". The model isn't in coderai's cache — download "
+                    f"'{model_name}' from the model interface first. ({joined})")
+        return f". Last output: {joined}" if joined else ""
+
+    # Wait (outside the lock) for the service to load the model and answer.
+    deadline = time.time() + ready_timeout
+    while time.time() < deadline:
+        if proc.poll() is not None:
+            raise RuntimeError(
+                f"Parler worker exited (code {proc.returncode}) before becoming ready"
+                + _tail_msg())
+        if _health_ok(url):
+            print(f"[parler] service ready for {model_name} at {url}", flush=True)
+            return url
+        time.sleep(2)
+    stop_service(model_name)
+    raise RuntimeError(f"Parler worker for {model_name} did not become ready in time"
+                       + _tail_msg())
+
+
+def stop_service(model_name: str) -> None:
+    with _lock:
+        svc = _services.pop(model_name, None)
+    if not svc:
+        return
+    proc = svc["proc"]
+    if proc.poll() is None:
+        try:
+            proc.terminate()
+            proc.wait(timeout=10)
+        except Exception:
+            pass
+    if proc.poll() is None:
+        try:
+            proc.kill()
+        except Exception:
+            pass
+    print(f"[parler] service for {model_name} stopped", flush=True)
+
+
+def stop_all() -> None:
+    for name in list(_services.keys()):
+        stop_service(name)
+
+
+import atexit as _atexit
+_atexit.register(stop_all)
--- a/codai/api/spatial.py
+++ b/codai/api/spatial.py
@@ -45,6 +45,31 @@ global_args = None
 global_file_path = None


+def _spatial_task(title: str):
+    """Decorator: register a spatial/3D endpoint in the unified task list so
+    every model type is visible there. Finishes done/error around the call."""
+    import functools
+
+    def deco(fn):
+        @functools.wraps(fn)
+        async def wrap(*args, **kwargs):
+            from codai.tasks import task_registry
+            tid = task_registry.register("spatial", title=title, model="spatial")
+            task_registry.start(tid)
+            try:
+                result = await fn(*args, **kwargs)
+                task_registry.finish(tid, "done")
+                return result
+            except HTTPException:
+                task_registry.finish(tid, "error")
+                raise
+            except Exception as e:
+                task_registry.finish(tid, "error", str(e)[:200])
+                raise
+        return wrap
+    return deco
+
+
 def set_global_args(args):
    global global_args
    global_args = args
@@ -500,6 +525,7 @@ class ImageTo3DRequest(BaseModel):


 @router.post("/v1/images/to3d", summary="Image to 3D model")
+@_spatial_task("Image → 3D")
 async def image_to_3d(request: ImageTo3DRequest, http_request: Request = None):
    """Convert a 2D image to a 3D representation.

@@ -568,6 +594,7 @@ class ImageFrom3DRequest(BaseModel):


 @router.post("/v1/images/from3d", summary="Render a 3D model to an image")
+@_spatial_task("3D → image")
 async def image_from_3d(request: ImageFrom3DRequest, http_request: Request = None):
    """Render a 3D model (GLB/OBJ) to a 2D PNG image from a specified camera angle."""
    raw = _decode_b64(request.model_data)
@@ -601,6 +628,7 @@ class VideoTo3DRequest(BaseModel):


 @router.post("/v1/video/to3d", summary="Video to 3D model")
+@_spatial_task("Video → 3D")
 async def video_to_3d(request: VideoTo3DRequest, http_request: Request = None):
    """Convert a 2D video to a 3D video frame-by-frame.

@@ -642,6 +670,7 @@ class VideoFrom3DRequest(BaseModel):


 @router.post("/v1/video/from3d", summary="Render a 3D model to a video")
+@_spatial_task("3D → video")
 async def video_from_3d(request: VideoFrom3DRequest, http_request: Request = None):
    """Render a 3D model as a 360° turntable video."""
    raw = _decode_b64(request.model_data)
@@ -675,6 +704,7 @@ class Generate3DRequest(BaseModel):


 @router.post("/v1/3d/generate", summary="Generate a 3D model from a prompt")
+@_spatial_task("Generate 3D")
 async def generate_3d(request: Generate3DRequest, http_request: Request = None):
    """Generate a 3D model (GLB) from a text prompt and/or an image.


--- a/codai/api/text.py
+++ b/codai/api/text.py
--- a/codai/api/transcriptions.py
+++ b/codai/api/transcriptions.py
@@ -135,6 +135,32 @@ async def create_transcription(
    if len(file_content) > _MAX_AUDIO_BYTES:
        raise HTTPException(status_code=413, detail="Audio file too large (max 100 MB)")

+    # Register a task so transcription appears in the unified task list, like
+    # every other model type. Finished on success or error below.
+    from codai.tasks import task_registry
+    _tid = task_registry.register(
+        "transcription",
+        title=(file.filename or "audio")[:80],
+        model=model or "",
+    )
+    task_registry.start(_tid)
+    try:
+        _resp = await _run_transcription(
+            file_content, model, language, prompt, response_format, temperature, file)
+        task_registry.finish(_tid, "done")
+        return _resp
+    except HTTPException:
+        task_registry.finish(_tid, "error")
+        raise
+    except Exception as e:
+        task_registry.finish(_tid, "error", str(e)[:200])
+        raise
+
+
+async def _run_transcription(
+    file_content: bytes, model: str, language, prompt, response_format, temperature, file
+):
+    """Core transcription logic; registered as a task by create_transcription()."""
    # Check if the requested model maps to a configured whisper-server instance first.
    # Try alias round-robin resolution before direct ID lookup.
    whisper_model_id = multi_model_manager.resolve_whisper_alias_model_id(model)

--- a/codai/api/tts.py
+++ b/codai/api/tts.py
--- a/codai/api/tts_backends.py
+++ b/codai/api/tts_backends.py
--- a/codai/backends/cuda.py
+++ b/codai/backends/cuda.py
--- a/codai/backends/ds4.py
+++ b/codai/backends/ds4.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""ds4 (DeepSeek V4) proxy backend.
+
+ds4-server already speaks the OpenAI HTTP API, so this backend is a thin proxy: it
+forwards chat/completion requests to the managed ``ds4-server`` subprocess (whose
+lifecycle is owned by :mod:`codai.api.ds4_worker`) and adapts the responses to the
+:class:`~codai.backends.base.ModelBackend` contract the model manager expects.
+
+Tool/think parsing is handled the same way as the other backends — by
+``ModelParserAdapter`` over the returned text — so tools are not forwarded to
+ds4-server; the text-level ``DeepSeekParser`` extracts ``<think>`` and tool calls.
+"""
+
+import asyncio
+import threading
+from typing import AsyncGenerator, Dict, List, Optional
+
+from codai.backends.base import ModelBackend
+
+
+class Ds4Backend(ModelBackend):
+    """Proxy backend that routes generation to a managed ds4-server."""
+
+    def __init__(self, cfg=None):
+        # cfg is a codai.config.Ds4Config. When omitted, resolve the active one.
+        if cfg is None:
+            from codai.config import Ds4Config
+            cfg = Ds4Config()
+        self._cfg = cfg
+        self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
+        self._url: Optional[str] = None
+        self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
+        self._last_usage: Dict = {}
+
+    # ------------------------------------------------------------------ #
+    # lifecycle
+    # ------------------------------------------------------------------ #
+    def load_model(self, model_name: str, **kwargs) -> None:
+        from codai.api import ds4_worker
+        if model_name:
+            self._model_id = model_name
+        self._url = ds4_worker.ensure_service(self._cfg)
+
+    def get_model_name(self) -> str:
+        return self._model_id
+
+    def get_context_size(self) -> int:
+        return self._ctx
+
+    def get_last_usage(self) -> dict:
+        return dict(self._last_usage)
+
+    def cleanup(self) -> None:
+        from codai.api import ds4_worker
+        ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id))
+        self._url = None
+
+    # ------------------------------------------------------------------ #
+    # helpers
+    # ------------------------------------------------------------------ #
+    def _base(self) -> str:
+        if not self._url:
+            raise RuntimeError("ds4 service not started")
+        return self._url
+
+    def _store_usage(self, usage: dict) -> None:
+        if usage:
+            self._last_usage = {
+                "prompt_tokens": usage.get("prompt_tokens", 0),
+                "completion_tokens": usage.get("completion_tokens", 0),
+                "total_tokens": usage.get("total_tokens", 0),
+            }
+
+    def format_messages(self, messages) -> str:
+        # ds4-server applies DeepSeek V4's own chat template server-side; this is only
+        # used by callers that need a flat prompt string.
+        parts = []
+        for m in messages:
+            role = m.get("role") if isinstance(m, dict) else getattr(m, "role", "")
+            content = m.get("content") if isinstance(m, dict) else getattr(m, "content", "")
+            parts.append(f"{role}: {content}")
+        return "\n".join(parts)
+
+    def _chat_payload(self, messages, max_tokens, temperature, top_p, stop, stream):
+        payload = {
+            "model": self._model_id,
+            "messages": messages,
+            "temperature": temperature,
+            "top_p": top_p,
+            "stream": stream,
+        }
+        if max_tokens:
+            payload["max_tokens"] = max_tokens
+        if stop:
+            payload["stop"] = stop
+        return payload
+
+    # ------------------------------------------------------------------ #
+    # chat-level generation (preferred by the manager)
+    # ------------------------------------------------------------------ #
+    def generate_chat(self, messages: List[Dict], max_tokens=None, temperature=0.7,
+                      top_p=1.0, stop=None, tools=None, response_format=None):
+        import requests
+        payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, False)
+        if response_format and response_format.get("type") == "json_object":
+            payload["response_format"] = {"type": "json_object"}
+        r = requests.post(self._base() + "/v1/chat/completions", json=payload, timeout=3600)
+        r.raise_for_status()
+        data = r.json()
+        self._store_usage(data.get("usage", {}))
+        return data["choices"][0]["message"].get("content") or ""
+
+    async def generate_chat_stream(self, messages: List[Dict], max_tokens=None,
+                                   temperature=0.7, top_p=1.0, stop=None, tools=None,
+                                   response_format=None) -> AsyncGenerator[str, None]:
+        payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, True)
+        async for chunk in self._stream(self._base() + "/v1/chat/completions", payload,
+                                        delta_key="delta"):
+            yield chunk
+
+    # ------------------------------------------------------------------ #
+    # plain completion (fallback path)
+    # ------------------------------------------------------------------ #
+    def generate(self, prompt: str, max_tokens=None, temperature: float = 0.7,
+                 top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
+                 presence_penalty: float = 0.0, frequency_penalty: float = 0.0) -> str:
+        return self.generate_chat([{"role": "user", "content": prompt}],
+                                  max_tokens, temperature, top_p, stop)
+
+    async def generate_stream(self, prompt: str, max_tokens=None, temperature: float = 0.7,
+                              top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
+                              presence_penalty: float = 0.0,
+                              frequency_penalty: float = 0.0) -> AsyncGenerator[str, None]:
+        async for chunk in self.generate_chat_stream(
+                [{"role": "user", "content": prompt}], max_tokens, temperature, top_p, stop):
+            yield chunk
+
+    # ------------------------------------------------------------------ #
+    # SSE streaming: iterate the blocking requests stream on a worker thread
+    # and hand chunks to the event loop through an asyncio.Queue.
+    # ------------------------------------------------------------------ #
+    async def _stream(self, url: str, payload: dict, delta_key: str
+                      ) -> AsyncGenerator[str, None]:
+        import json
+        loop = asyncio.get_event_loop()
+        queue: asyncio.Queue = asyncio.Queue()
+        _SENTINEL = object()
+
+        def _worker():
+            import requests
+            try:
+                with requests.post(url, json=payload, stream=True, timeout=3600) as r:
+                    r.raise_for_status()
+                    for raw in r.iter_lines(decode_unicode=True):
+                        if not raw or not raw.startswith("data:"):
+                            continue
+                        data = raw[len("data:"):].strip()
+                        if data == "[DONE]":
+                            break
+                        try:
+                            obj = json.loads(data)
+                        except ValueError:
+                            continue
+                        choice = (obj.get("choices") or [{}])[0]
+                        text = (choice.get(delta_key) or {}).get("content") or ""
+                        if text:
+                            loop.call_soon_threadsafe(queue.put_nowait, text)
+                        if obj.get("usage"):
+                            self._store_usage(obj["usage"])
+                        if choice.get("finish_reason"):
+                            break
+            except Exception as exc:  # surface to the consumer
+                loop.call_soon_threadsafe(queue.put_nowait, exc)
+            finally:
+                loop.call_soon_threadsafe(queue.put_nowait, _SENTINEL)
+
+        threading.Thread(target=_worker, daemon=True).start()
+        while True:
+            item = await queue.get()
+            if item is _SENTINEL:
+                break
+            if isinstance(item, Exception):
+                raise item
+            yield item
--- a/codai/backends/vulkan.py
+++ b/codai/backends/vulkan.py
--- a/codai/broker/capabilities.py
+++ b/codai/broker/capabilities.py
@@ -49,7 +49,13 @@ def build_hardware_summary() -> Dict[str, Any]:
    total_vram_mb = 0
    available_vram_mb = 0

+    # Only use torch if it's ALREADY loaded (i.e. we're in an engine). Never import
+    # it here — the front is torch-free and must stay that way (importing torch in
+    # the front is heavy and would initialise CUDA in the wrong process).
+    import sys as _sys
    try:
+        if "torch" not in _sys.modules:
+            raise ImportError("torch not loaded (front) — using torch-free path")
        import torch

        if torch.cuda.is_available():
@@ -76,6 +82,23 @@ def build_hardware_summary() -> Dict[str, Any]:
    except Exception:
        pass

+    # Torch-free path (e.g. the front, which imports no torch): enumerate every
+    # physical card via nvidia-smi + sysfs so VRAM is reported for the whole node.
+    if not gpus:
+        try:
+            from codai.frontproxy.gpu_detect import gpu_stats
+            for c in gpu_stats():
+                total_mb = int(round((c.get("mem_total") or 0) * 1024))
+                used_mb = int(round((c.get("mem_used") or 0) * 1024))
+                if total_mb <= 0:
+                    continue
+                gpus.append({"name": c.get("name") or c.get("vendor"),
+                             "total_vram_mb": total_mb})
+                total_vram_mb += total_mb
+                available_vram_mb += max(0, total_mb - used_mb)
+        except Exception:
+            pass
+
    if not gpus:
        for total_path in sorted(glob.glob("/sys/class/drm/card*/device/mem_info_vram_total")):
            used_path = total_path.replace("vram_total", "vram_used")

--- a/codai/broker/dispatcher.py
+++ b/codai/broker/dispatcher.py
@@ -60,8 +60,13 @@ def _is_text_response(content_type: str | None) -> bool:
    )


-async def execute_broker_request(app, envelope):
-    """Validate and execute a broker request envelope."""
+async def execute_broker_request(app, envelope, executor=None):
+    """Validate and execute a broker request envelope.
+
+    ``executor`` is an ``async (method, path, headers, query, body) -> {status_code,
+    headers, body}`` callable. When omitted the request is run in-process against
+    ``app`` via the ASGI bridge (engine / single-process mode). The front passes its
+    own executor that proxies to the right engine over HTTP."""

    logger.debug(
        "broker dispatch → op=%s request_id=%s path=%r method=%r stream=%s",
@@ -136,14 +141,20 @@ async def execute_broker_request(app, envelope):
        headers["content-type"] = envelope.content_type

    started_at = perf_counter()
-    response = await execute_internal_request(
-        app,
-        method=envelope.method,
-        path=envelope.path,
-        headers=headers,
-        query=envelope.query,
-        body=body,
-    )
+    if executor is not None:
+        response = await executor(
+            method=envelope.method, path=envelope.path, headers=headers,
+            query=envelope.query, body=body,
+        )
+    else:
+        response = await execute_internal_request(
+            app,
+            method=envelope.method,
+            path=envelope.path,
+            headers=headers,
+            query=envelope.query,
+            body=body,
+        )
    elapsed_ms = round((perf_counter() - started_at) * 1000, 3)

    response_headers = response["headers"]

--- a/codai/cli.py
+++ b/codai/cli.py
@@ -224,6 +224,13 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
        action="store_true",
        help="Dump model output: raw output, parsed output, and litellm debug info",
    )
+    parser.add_argument(
+        "--debug-requests",
+        action="store_true",
+        help="Log the full request/response payloads exchanged with API clients "
+             "(opencode, etc.): incoming messages + tools and the outgoing "
+             "content/tool_calls. Use to diagnose agentic tool-call loops.",
+    )
    parser.add_argument(
        "--list-cached-models",
        action="store_true",
@@ -278,4 +285,39 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
        help="Ignore any existing pipeline cache and rebuild it from scratch this "
             "run (use after changing a model's quantization/precision config).",
    )
+    # ─── Frontend/engine split ───────────────────────────────────────────────
+    parser.add_argument(
+        "--single-process",
+        action="store_true",
+        help="Run the legacy single-process server (UI/API and all model work in "
+             "one process). Default boots a front proxy + supervised engine "
+             "subprocess(es) so the web UI stays responsive during model work.",
+    )
+    parser.add_argument(
+        "--engine-only",
+        action="store_true",
+        help="Run this process as an engine (binds an internal localhost port, no "
+             "front proxy). Normally launched automatically by the front; not "
+             "intended to be run by hand.",
+    )
+    parser.add_argument(
+        "--internal-port",
+        type=int,
+        default=None,
+        help="Internal port for --engine-only mode (the front assigns one per engine).",
+    )
+    parser.add_argument(
+        "--debug-engine",
+        action="store_true",
+        help="General engine debugging in the front/engine split (engine lifecycle, "
+             "spawn details, health transitions). Does NOT include the internal "
+             "HTTP access log — use --debug-engine-web for that.",
+    )
+    parser.add_argument(
+        "--debug-engine-web",
+        action="store_true",
+        help="Show the internal front↔engine HTTP requests in an engine's access log "
+             "(proxied calls, /internal/engine-state, /healthz, …). Suppressed by "
+             "default since every engine only ever serves internal front traffic.",
+    )
    return parser.parse_args()
--- a/codai/config.py
+++ b/codai/config.py
--- a/codai/frontproxy/__init__.py
+++ b/codai/frontproxy/__init__.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Front proxy package: always-responsive web/API front + supervised engines.
+
+See ``docs/frontend-engine-split.md`` and ``docs/process-isolation-plans.md``.
+"""
+
+from codai.frontproxy.app import run_front, build_app
+
+__all__ = ["run_front", "build_app"]
--- a/codai/frontproxy/app.py
+++ b/codai/frontproxy/app.py
--- a/codai/frontproxy/assignment.py
+++ b/codai/frontproxy/assignment.py
--- a/codai/frontproxy/engine_supervisor.py
+++ b/codai/frontproxy/engine_supervisor.py
--- a/codai/frontproxy/gpu_detect.py
+++ b/codai/frontproxy/gpu_detect.py
--- a/codai/frontproxy/registry.py
+++ b/codai/frontproxy/registry.py
--- a/codai/frontproxy/router.py
+++ b/codai/frontproxy/router.py
--- a/codai/main.py
+++ b/codai/main.py
--- a/codai/models/capabilities.py
+++ b/codai/models/capabilities.py
@@ -21,6 +21,7 @@ from threading import Lock
 from typing import List, Optional
 import json
 import os
+import re
 import time


@@ -179,11 +180,15 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
        return caps

    # ── Image: upscaling (checked before general SD rule to catch SD-family upscalers) ──
-    if any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
-                              'bsrgan', 'hat-', 'dat-',
+    # 'hat-'/'dat-' are short, ambiguous tokens (e.g. they appear inside
+    # "chat-", "update-"); require a word boundary before them so a text "chat"
+    # model isn't mistaken for the HAT/DAT super-resolution checkpoints.
+    if (any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
+                              'bsrgan',
                              'x2-upscaler', 'x4-upscaler', 'x2_upscaler', 'x4_upscaler',
                              'latent-upscaler', 'latent_upscaler',
-                              'ldm-super-resolution', 'rcan-', 'sr3-']):
+                              'ldm-super-resolution', 'rcan-', 'sr3-'])
+            or re.search(r'\b[hd]at-', n)):
        caps.image_upscaling = True
        caps.image_to_image = True
        return caps

--- a/codai/models/manager.py
+++ b/codai/models/manager.py
--- a/codai/models/parser.py
+++ b/codai/models/parser.py
--- a/codai/models/quant.py
+++ b/codai/models/quant.py
--- a/codai/models/ram_monitor.py
+++ b/codai/models/ram_monitor.py
--- a/codai/models/thermal.py
+++ b/codai/models/thermal.py
--- a/codai/models/tmp_janitor.py
+++ b/codai/models/tmp_janitor.py
--- a/codai/tasks/registry.py
+++ b/codai/tasks/registry.py
--- a/commands
+++ b/commands
+python tools/video_editor.py --no-browser --host 0.0.0.0 --media-dir tools/coderai_media --session
+tools/gen_township_fighters.py -c township_output/township_config.json
+
--- a/docs/deepseek-ds4.md
+++ b/docs/deepseek-ds4.md
--- a/docs/expressive-tts.md
+++ b/docs/expressive-tts.md
--- a/docs/frontend-engine-split.md
+++ b/docs/frontend-engine-split.md
--- a/docs/process-isolation-plans.md
+++ b/docs/process-isolation-plans.md
--- a/docs/reverse-proxy-nginx.md
+++ b/docs/reverse-proxy-nginx.md
--- a/packaging/linux/Dockerfile.oci
+++ b/packaging/linux/Dockerfile.oci
--- a/packaging/linux/Dockerfile.oci-venv
+++ b/packaging/linux/Dockerfile.oci-venv
--- a/packaging/linux/Dockerfile.update
+++ b/packaging/linux/Dockerfile.update
--- a/packaging/linux/README-RUN.txt
+++ b/packaging/linux/README-RUN.txt
--- a/packaging/linux/build_oci_image.sh
+++ b/packaging/linux/build_oci_image.sh
--- a/packaging/linux/launcher/coderai-entrypoint
+++ b/packaging/linux/launcher/coderai-entrypoint
--- a/packaging/linux/launcher/coderai-oci
+++ b/packaging/linux/launcher/coderai-oci
--- a/packaging/linux/launcher/sadtalker
+++ b/packaging/linux/launcher/sadtalker
--- a/packaging/linux/launcher/wav2lip
+++ b/packaging/linux/launcher/wav2lip
--- a/packaging/linux/launcher/with-env
+++ b/packaging/linux/launcher/with-env
--- a/packaging/linux/nginx.conf
+++ b/packaging/linux/nginx.conf
--- a/packaging/linux/run_oci.sh
+++ b/packaging/linux/run_oci.sh
--- a/packaging/linux/smoke_test_services.sh
+++ b/packaging/linux/smoke_test_services.sh
--- a/packaging/linux/supervisord.conf
+++ b/packaging/linux/supervisord.conf
--- a/packaging/linux/update_oci_image.sh
+++ b/packaging/linux/update_oci_image.sh
--- a/requirements-nvidia.txt
+++ b/requirements-nvidia.txt
--- a/tools/gen_township_fighters.py
+++ b/tools/gen_township_fighters.py
--- a/tools/parler_tts_service.py
+++ b/tools/parler_tts_service.py
--- a/tools/video_editor.py
+++ b/tools/video_editor.py
--- a/tools/videogen.py
+++ b/tools/videogen.py
--- a/video_editor.config.json
+++ b/video_editor.config.json