front/engine split, ds4 + media tooling, gemma-4 native tools; ignore runtime artifacts

- frontproxy: torch-free front proxy + per-vendor engine supervisor with auth, localhost binding, model routing; Ctrl-C now force-kills engines (own session + PDEATHSIG, SIGKILL of engine process groups, watchdog on hung drain) - gemma-4 tool calling: prompt via native tools= template, parse call:NAME{...} into tool_calls, honour generation_config EOS so it stops instead of looping - ds4 external worker, parler/expressive TTS backends, video editor tooling - --debug-requests: full client<->API request/response logging + live snapshots - stop tracking runtime artifacts (video_editor/sessions/, tools/coderai_media/) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

front/engine split, ds4 + media tooling, gemma-4 native tools; ignore runtime artifacts
- frontproxy: torch-free front proxy + per-vendor engine supervisor with auth, localhost binding, model routing; Ctrl-C now force-kills engines (own session + PDEATHSIG, SIGKILL of engine process groups, watchdog on hung drain) - gemma-4 tool calling: prompt via native tools= template, parse call:NAME{...} into tool_calls, honour generation_config EOS so it stops instead of looping - ds4 external worker, parler/expressive TTS backends, video editor tooling - --debug-requests: full client<->API request/response logging + live snapshots - stop tracking runtime artifacts (video_editor/sessions/, tools/coderai_media/) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
b297b25f · Stefy Lanza (nextime / spora ) · 2fb085f4 · b297b25f · b297b25f · b297b25f
Commit b297b25f authored Jun 18, 2026 by Stefy Lanza (nextime / spora )
46 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -33,3 +33,7 @@ township_output/
 # Packaging build cache + runtime temp (large artifacts)
 .packaging-cache/
 tmp/
+
+# Video editor sessions + generated media (runtime artifacts)
+video_editor/sessions/
+tools/coderai_media/
--- a/build.sh
+++ b/build.sh
@@ -35,6 +35,7 @@ BACKEND="${1:-all}"
 FLASH=false
 CUSTOM_VENV=""
 PACKAGE=false
+DS4=false

 # Parse arguments
 i=1
@@ -50,6 +51,9 @@ for arg in "$@"; do
        --package)
            PACKAGE=true
            ;;
+        --ds4)
+            DS4=true
+            ;;
    esac
    i=$((i + 1))
 done
@@ -68,6 +72,7 @@ if [[ "$BACKEND" != "nvidia" && "$BACKEND" != "vulkan" && "$BACKEND" != "vulkan-
    echo ""
    echo "Options:"
    echo "  --flash     - Install Flash Attention 2 for faster inference (NVIDIA only)"
+    echo "  --ds4       - Clone + build the ds4 (DeepSeek V4) native engine"
    exit 1
 fi

@@ -755,6 +760,35 @@ package_app() {
    echo -e "${YELLOW}Note: The target machine must still provide compatible system GPU/runtime libraries.${NC}"
 }

+# Optionally clone + build ds4 (DeepSeek V4 native engine). Opt-in via --ds4.
+# coderai can also auto-build this at runtime on first use, but doing it here lets
+# the OCI/Docker packaging bundle the prebuilt ds4-server binary.
+build_ds4() {
+    local DS4_DIR="${CODERAI_DS4_DIR:-$HOME/.coderai/ds4}"
+    echo -e "${YELLOW}Building ds4 (DeepSeek V4 engine) → $DS4_DIR ...${NC}"
+    if [ ! -e "$DS4_DIR/Makefile" ]; then
+        mkdir -p "$(dirname "$DS4_DIR")"
+        git clone --depth 1 https://github.com/antirez/ds4 "$DS4_DIR" || {
+            echo -e "${YELLOW}Warning: could not clone ds4; skipping.${NC}"; return 0; }
+    fi
+    local TARGET="cpu"
+    if command -v nvcc &> /dev/null || [ -d "/usr/local/cuda" ]; then
+        TARGET="cuda-generic"
+    elif [ "$(uname -s)" = "Darwin" ]; then
+        TARGET=""   # bare `make` builds the macOS Metal backend
+    fi
+    ( cd "$DS4_DIR" && make $TARGET ) || {
+        echo -e "${YELLOW}Warning: ds4 build failed; it can still be built at runtime.${NC}"; return 0; }
+    if [ -x "$DS4_DIR/ds4-server" ]; then
+        echo -e "${GREEN}✓ ds4-server built at $DS4_DIR/ds4-server${NC}"
+        echo -e "${YELLOW}Note: DeepSeek V4 weights are downloaded on first use (multi-GB).${NC}"
+    fi
+}
+
+if [ "$DS4" = true ]; then
+    build_ds4
+fi
+
 # Create .backend file to track which backend was used
 echo "$BACKEND" > .backend


--- a/codai/admin/routes.py
+++ b/codai/admin/routes.py
--- a/codai/admin/templates/chat.html
+++ b/codai/admin/templates/chat.html
@@ -2372,7 +2372,7 @@ const STUDIO_CAPABILITIES = {
    optional:[],
    notes:[
      'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
-      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
+      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
      'No AI model selection needed — this feature uses its own dedicated backend.',
    ],
    backendPath: ROOT_PATH + '/v1/images/faceswap',
@@ -2386,7 +2386,7 @@ const STUDIO_CAPABILITIES = {
    optional:[],
    notes:[
      'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
-      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
+      'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
      'No AI model selection needed — this feature uses its own dedicated backend.',
    ],
    backendPath: ROOT_PATH + '/v1/images/faceswap',
@@ -2461,14 +2461,14 @@ function capSearchUrl(cap) {
  const s = CAP_TO_HF_SEARCH[cap];
  if (!s) return null;
  const p = new URLSearchParams({ tab:'search', q: s.q, pipeline: s.pipeline, gguf: s.gguf });
-  return '/admin/models?' + p.toString();
+  return (window.ROOT_PATH || '') + '/admin/models?' + p.toString();
 }
 function capMissingHtml(caps, label) {
  if (!caps.length) return '';
  const links = caps.map(cap => {
    const chip = `<span class="cap-chip dim">${cap.replace(/_/g,' ')}</span>`;
    if (_localCapSet.has(cap)) {
-      const url = `/admin/models?local_cap=${encodeURIComponent(cap)}`;
+      const url = `${window.ROOT_PATH || ''}/admin/models?local_cap=${encodeURIComponent(cap)}`;
      return `<a href="${url}" class="cap-find-link" title="You have a local model with ${cap.replace(/_/g,' ')} — click to configure it">${chip}<span class="cap-find-icon" style="color:#6ecf7e">↑ configure</span></a>`;
    }
    const url = capSearchUrl(cap);

--- a/codai/admin/templates/models.html
+++ b/codai/admin/templates/models.html
@@ -577,6 +577,13 @@ window.__DEFAULT_WHISPER_SERVER_PATH__ = {{ default_whisper_server_path|tojson }
          </select>
        </div>
      </div>
+      <div class="form-row" id="cfg-engine-row" style="margin-top:.75rem;display:none">
+        <label class="form-label">Engine / card</label>
+        <select id="cfg-engine" class="form-input">
+          <option value="">Default (auto — by capability)</option>
+        </select>
+        <span class="form-hint" style="font-size:11px">Pin this model to a specific engine/card. Overrides the default engine. Only shown when multiple engines are running.</span>
+      </div>
      <div style="display:grid;grid-template-columns:1fr 1fr;gap:.75rem;margin-top:.75rem">
        <div class="form-row" style="margin:0">
          <label class="form-label">Used VRAM <span class="muted">(GB)</span></label>
@@ -1441,8 +1448,7 @@ function handleProgressEvent(evt){
    showDownloadError(evt.message);
  }else if(evt.type==='cancelled'){
    _dlDone=true;
-    if(_dlEs){_dlEs.close();_dlEs=null;}
-    showDownloadError('Download cancelled');
+    showDownloadCancelled();
  }
  // keepalive: ignore
 }
@@ -1483,7 +1489,7 @@ async function reopenDownload(session_id){
        if(s.rate) document.getElementById('dl-speed').textContent=fmtRate(s.rate);
        if(s.eta!=null) document.getElementById('dl-eta').textContent=fmtEta(s.eta);
        if(s.status==='done'){handleProgressEvent({type:'done'});return;}
-        if(s.status==='cancelled'){showDownloadError('Download cancelled');return;}
+        if(s.status==='cancelled'){_dlDone=true;showDownloadCancelled();return;}
        if(s.status==='error'){showDownloadError(s.error||'Download failed');return;}
      }
    }
@@ -1501,15 +1507,27 @@ async function reopenDownload(session_id){
  };
 }

+function showDownloadCancelled(){
+  if(_dlEs){_dlEs.close();_dlEs=null}
+  document.getElementById('dl-form').style.display='block';
+  document.getElementById('dl-progress').style.display='none';
+}
+
 async function stopDownload(session_id){
  if(!confirm('Cancel this download?')) return;
  try{
-    await fetch(ROOT_PATH + '/admin/api/download-cancel/'+session_id, {method:'POST'});
+    const r = await fetch(ROOT_PATH + '/admin/api/download-cancel/'+session_id, {method:'POST'});
+    if(!r.ok){
+      let detail = r.status+' '+r.statusText;
+      try{ const j = await r.json(); if(j&&j.detail) detail = j.detail; }catch{}
+      alert('Could not cancel download: '+detail);
+      return;
+    }
    if(_dlSessionId===session_id){
-      if(_dlEs){_dlEs.close();_dlEs=null;}
      _dlDone=true;
-      showDownloadError('Download cancelled');
+      showDownloadCancelled();
    }
+    pollDownloads(); // refresh the active-downloads strip immediately
  }catch(e){
    alert('Could not cancel download: '+e.message);
  }
@@ -1798,12 +1816,53 @@ let _localModels = [];
 let _ggufFiles = [];
 let _hfModels = [];

+// Engine/card hardware info (fetched once); used to tag models with the card they
+// run on when more than one engine is configured.
+let _engineNames = [];
+let _defaultEngine = '';
+async function _loadEngineInfo(){
+  // Live engine names from the front (covers auto-detected engines, not just those
+  // declared in engine_specs); default_engine still comes from settings.
+  try {
+    const er = await fetch(ROOT_PATH + '/admin/api/engines');
+    if (er.ok) _engineNames = ((await er.json()).engines || []).map(e => e.name);
+  } catch(e) {}
+  try {
+    const d = await (await fetch(ROOT_PATH + '/admin/api/settings')).json();
+    if (!_engineNames.length) _engineNames = (d.server && d.server.engine_names) || [];
+    _defaultEngine = (d.server && d.server.default_engine) || '';
+  } catch(e) {}
+}
+// Compact card tag for a model config. Pinned engines show as-is (with 📌);
+// otherwise the engine is inferred from the model's format (transformers/ds4 →
+// nvidia; gguf/whisper → the default engine, or "any"). Hidden when ≤1 engine, so
+// it never widens single-card setups.
+function _engineTagHtml(m, s){
+  if(!_engineNames || _engineNames.length < 2) return '';
+  let eng = ((s && s.engine) || '').trim();
+  let pinned = !!eng;
+  if(!eng){
+    const path = (((m && (m.path || m.id || m.filename)) || '') + '').toLowerCase();
+    const isGguf = path.endsWith('.gguf') || path.includes('gguf');
+    const isWhisper = ((s && s.backend) || '') === 'whisper-server';
+    const isDs4 = path.includes('deepseek-v4');
+    if(isDs4 || (!isGguf && !isWhisper)) eng = 'nvidia';   // ds4/transformers → nvidia
+    else eng = _defaultEngine || 'any';                     // gguf/whisper → default
+  }
+  const lc = eng.toLowerCase();
+  const color = (lc.includes('nv')) ? '#76b900'
+              : (lc.includes('rad') || lc.includes('amd')) ? '#ed1c24'
+              : 'var(--text-3)';
+  const title = pinned ? ('Pinned to engine: ' + eng) : ('Runs on: ' + eng + ' (auto)');
+  return `<span class="badge" title="${esc(title)}" style="font-size:9px;padding:.05rem .3rem;margin:.1rem .1rem 0 0;vertical-align:middle;border:1px solid ${color};color:${color};background:transparent">${esc(eng)}${pinned?' 📌':''}</span>`;
+}
+
 function _renderConfigPills(idx, m) {
  const configs = m.configs || [];
  if (!configs.length) return '';
  const pills = configs.map((c, cfgIdx) => {
    const label = (c.settings && (c.settings.config_name || c.settings.alias)) || `Config ${cfgIdx + 1}`;
-    return `<span class="badge badge-user" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem .1rem 0 0" onclick="openCfgModal(${idx},${cfgIdx})" title="Edit this configuration">${esc(label)}</span>`;
+    return `<span class="badge badge-user" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem .1rem 0 0" onclick="openCfgModal(${idx},${cfgIdx})" title="Edit this configuration">${esc(label)}</span>${_engineTagHtml(m, c.settings)}`;
  }).join('');
  const addPill = `<span class="badge" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem 0 0 0;background:var(--raised);border:1px dashed var(--border);color:var(--text-2)" onclick="openCfgModalNew(${idx})" title="Add another configuration for this model">+ Config</span>`;
  return `<br style="line-height:.5rem">${pills}${addPill}`;
@@ -2338,6 +2397,9 @@ async function refreshLocal(){
 }

 loadGlobalSettings();
+// Load engine/card info first so the per-model card tags render on the first paint,
+// then re-render once it's available (covers the fetch resolving after the list).
+_loadEngineInfo().then(() => loadCachedModels());
 refreshLocal();

 // Toggle the acceleration / TurboQuant sections as model types are checked/unchecked.
@@ -2731,6 +2793,7 @@ function openCfgModal(idx, cfgIdx){
  document.getElementById('cfg-noram').checked = !!s.no_ram;
  document.getElementById('cfg-offload-strategy').value = s.offload_strategy || 'auto';
  document.getElementById('cfg-offload-dir').value = s.offload_dir || _defaultOffloadDir;
+  _populateEnginePin(s.engine || '');
  document.getElementById('cfg-sysprompt').value = s.system_prompt || '';
  document.getElementById('cfg-parser').value = s.parser || (!m.in_config ? _autoDetectParser(m.path) : 'auto');
  document.getElementById('cfg-tools').checked = !!s.tools_closer_prompt;
@@ -3027,6 +3090,21 @@ async function removeThisConfig(){
  } catch(e) { alert('Error: ' + e.message); }
 }

+// Engine-pin field: populate the datalist from declared engines and only show the
+// row when more than one engine is configured (single-engine setups don't need it).
+async function _populateEnginePin(desired){
+  const row = document.getElementById('cfg-engine-row');
+  const sel = document.getElementById('cfg-engine');
+  try {
+    if (!_engineNames || !_engineNames.length) await _loadEngineInfo();
+    const want = (desired !== undefined) ? desired : sel.value;
+    sel.querySelectorAll('option:not([value=""])').forEach(o => o.remove());
+    _engineNames.forEach(n => { const o=document.createElement('option'); o.value=n; o.textContent=n; sel.appendChild(o); });
+    sel.value = want || '';   // set AFTER options exist so the selection sticks
+    row.style.display = _engineNames.length > 1 ? '' : 'none';
+  } catch(e) { row.style.display = 'none'; }
+}
+
 async function saveModelConfig(){
  const path = document.getElementById('cfg-path').value;
  const maxGpu = parseFloat(document.getElementById('cfg-max-gpu').value);
@@ -3063,6 +3141,7 @@ async function saveModelConfig(){
    no_ram:            document.getElementById('cfg-noram').checked,
    offload_strategy:  document.getElementById('cfg-offload-strategy').value,
    offload_dir:       document.getElementById('cfg-offload-dir').value.trim() || './offload',
+    engine:            document.getElementById('cfg-engine').value.trim() || null,
    system_prompt:     document.getElementById('cfg-sysprompt').value.trim() || null,
    parser:            document.getElementById('cfg-parser').value,
    tools_closer_prompt: document.getElementById('cfg-tools').checked,
@@ -3094,7 +3173,12 @@ async function saveModelConfig(){
      body: JSON.stringify(data)
    });
    const d = await r.json();
-    if(d.success){ closeModal('cfg-modal'); loadCachedModels(); }
+    if(d.success){
+      if (d.warnings && d.warnings.length) {
+        alert('Saved, but check this:\n\n• ' + d.warnings.join('\n• '));
+      }
+      closeModal('cfg-modal'); loadCachedModels();
+    }
    else alert('Error: '+(d.detail||'Unknown'));
  }catch(e){ alert('Error: '+e.message); }
 }

--- a/codai/admin/templates/settings.html
+++ b/codai/admin/templates/settings.html
--- a/codai/admin/templates/tasks.html
+++ b/codai/admin/templates/tasks.html
@@ -30,11 +30,23 @@
 <div id="sys-stats" style="display:grid;grid-template-columns:repeat(auto-fit,minmax(220px,1fr));
     gap:.75rem;margin:0 0 1.25rem">
  <div class="sys-tile" id="tile-cpu"></div>
-  <div class="sys-tile" id="tile-gpu"></div>
  <div class="sys-tile" id="tile-ram"></div>
+  <!-- Per-card GPU tiles (util + VRAM) injected here when cards are detected. -->
+  <div id="tile-cards" style="display:contents"></div>
+  <!-- Fallback single tiles when per-card stats are unavailable. -->
+  <div class="sys-tile" id="tile-gpu"></div>
  <div class="sys-tile" id="tile-vram"></div>
 </div>

+<!-- Engines (only shown in front/multi-engine mode) -->
+<div id="engines-card" style="display:none;margin:0 0 1.25rem">
+  <div style="display:flex;align-items:baseline;gap:.5rem;margin-bottom:.5rem">
+    <h2 style="font-size:14px;margin:0">Engines</h2>
+    <span class="dim small">restart a stuck engine — the supervisor respawns it</span>
+  </div>
+  <div id="engines-body" style="display:grid;grid-template-columns:repeat(auto-fit,minmax(240px,1fr));gap:.6rem"></div>
+</div>
+
 <style>
 .sys-tile{border:1px solid var(--border,#2a2a2a);border-radius:10px;padding:.7rem .85rem;
  background:var(--card-bg,rgba(255,255,255,.02))}
@@ -76,7 +88,7 @@ function fmtTime(s) {
  } catch { return ''; }
 }

-const KIND_LABEL = {training:'Training', image:'Image', video:'Video', upscale:'Upscale', interpolate:'Interpolate', audio:'Audio', text:'Text', pipeline:'Pipeline', request:'Request', loading:'Loading'};
+const KIND_LABEL = {training:'Training', image:'Image', video:'Video', upscale:'Upscale', interpolate:'Interpolate', audio:'Audio', text:'Text', tts:'Speech (TTS)', transcription:'Transcription', embedding:'Embedding', spatial:'3D / Spatial', pipeline:'Pipeline', request:'Request', loading:'Loading'};
 const STATUS_BADGE = {
  running:'badge-admin', queued:'badge-user', done:'badge-ok', error:'badge-err',
  cancelled:'badge-user', interrupted:'badge-warn'
@@ -140,18 +152,89 @@ function _memTile(name, used, total, pct){
  return `<div class="sys-head"><span class="sys-name">${name}</span><span class="sys-val">${valTxt}</span></div>`
    + _bar(p) + `<div class="sys-sub"><span>${p == null ? '' : Math.round(p)+'% used'}</span><span></span></div>`;
 }
+// One tile per physical card showing both GPU utilization and VRAM (+ temp).
+function _cardTile(c){
+  const vColor = c.vendor==='nvidia' ? '#76b900'
+               : c.vendor==='amd' ? '#ed1c24' : 'var(--text-3)';
+  const memP = (c.mem_total ? (c.mem_used / c.mem_total * 100) : null);
+  const temp = (c.temp!=null) ? ' · '+Math.round(c.temp)+'°C' : '';
+  const util = (c.util!=null) ? Math.round(c.util)+'%' : '—';
+  return `<div class="sys-tile">
+    <div class="sys-head"><span class="sys-name" style="color:${vColor}">${esc(c.name)}</span>
+      <span class="sys-val">${util}${temp}</span></div>
+    ${_bar(c.util)}
+    <div class="sys-sub"><span>VRAM ${c.mem_used!=null?c.mem_used.toFixed(1):'—'}/${c.mem_total!=null?c.mem_total.toFixed(0):'—'} GB</span>
+      <span>${memP!=null?Math.round(memP)+'% used':''}</span></div>
+    ${_bar(memP)}
+  </div>`;
+}
+
 async function loadSystemStats(){
  try {
    const s = await fetch(ROOT_PATH + '/admin/api/system-stats').then(r => r.json());
    const cpu = s.cpu || {}, gpu = s.gpu || {}, ram = s.ram || {}, vram = s.vram || {};
    document.getElementById('tile-cpu').innerHTML = _utilTile('CPU', cpu.util, cpu.temp, (cpu.cores || 1) * 100);
-    document.getElementById('tile-gpu').innerHTML = _utilTile('GPU', gpu.util, gpu.temp);
    document.getElementById('tile-ram').innerHTML = _memTile('RAM', ram.used, ram.total, ram.percent);
-    document.getElementById('tile-vram').innerHTML =
-      _memTile('VRAM', vram.used, vram.total, vram.percent);
+
+    // Per-card GPU+VRAM tiles for every physical card; fall back to single tiles.
+    let cards = [];
+    try { cards = ((await fetch(ROOT_PATH + '/admin/api/gpu-stats').then(r => r.json())).cards) || []; } catch(e){}
+    const cardsEl = document.getElementById('tile-cards');
+    const gpuEl = document.getElementById('tile-gpu');
+    const vramEl = document.getElementById('tile-vram');
+    if (cards.length) {
+      cardsEl.innerHTML = cards.map(_cardTile).join('');
+      gpuEl.style.display = 'none'; vramEl.style.display = 'none';
+    } else {
+      cardsEl.innerHTML = '';
+      gpuEl.style.display = ''; vramEl.style.display = '';
+      gpuEl.innerHTML = _utilTile('GPU', gpu.util, gpu.temp);
+      vramEl.innerHTML = _memTile('VRAM', vram.used, vram.total, vram.percent);
+    }
  } catch(e){ /* keep last render on transient errors */ }
 }

+// Engines panel — only present in front/multi-engine mode (404 in single-process).
+async function loadEngines(){
+  let engines = null;
+  try {
+    const r = await fetch(ROOT_PATH + '/admin/api/engines');
+    if (!r.ok) { document.getElementById('engines-card').style.display = 'none'; return; }
+    engines = (await r.json()).engines || [];
+  } catch(e) { document.getElementById('engines-card').style.display = 'none'; return; }
+  const card = document.getElementById('engines-card');
+  if (!engines.length) { card.style.display = 'none'; return; }
+  card.style.display = '';
+  document.getElementById('engines-body').innerHTML = engines.map(e => {
+    const dot = e.healthy ? '#3fb950' : '#e5534b';
+    const state = e.healthy ? 'healthy' : 'down / starting';
+    const vram = e.vram ? `${(e.vram.used ?? 0).toFixed ? e.vram.used.toFixed(1) : e.vram.used}/${e.vram.total} GB` : '';
+    const cool = e.cooling ? ` <span class="badge badge-warn" style="font-size:9px">❄ cooling</span>` : '';
+    const prim = e.primary ? ` <span class="badge badge-user" style="font-size:9px">primary</span>` : '';
+    const models = (e.loaded_models||[]).length;
+    return `<div class="sys-tile">
+      <div class="sys-head">
+        <span class="sys-name">${esc(e.name)} <span class="dim" style="text-transform:none">(${esc(e.backend)})</span>${prim}${cool}</span>
+        <span style="width:9px;height:9px;border-radius:50%;background:${dot};display:inline-block" title="${state}"></span>
+      </div>
+      <div class="sys-sub"><span>${esc(state)}${vram?' · '+esc(vram):''}</span><span>${models} model${models!==1?'s':''}</span></div>
+      <div style="margin-top:.5rem;text-align:right">
+        <button class="btn btn-ghost" style="font-size:11px;padding:.15rem .5rem;color:var(--error,#e55)"
+                onclick="restartEngine(${e.id}, '${esc(e.name)}')" title="Kill and respawn this engine">↻ Restart</button>
+      </div>
+    </div>`;
+  }).join('');
+}
+
+async function restartEngine(id, name){
+  if (!confirm(`Restart engine "${name}"? In-flight requests on it will fail; the supervisor respawns it immediately.`)) return;
+  try {
+    const r = await fetch(ROOT_PATH + '/admin/api/engines/' + id + '/restart', {method:'POST'});
+    if (!r.ok) { const e = await r.json().catch(()=>({})); alert(e.detail || 'Restart failed'); }
+    setTimeout(loadEngines, 800);
+  } catch(e) { alert(e.message); }
+}
+
 let _refreshing = false;
 async function loadTasks() {
  if (_refreshing) return;
@@ -165,7 +248,19 @@ async function loadTasks() {

    const therm = data.thermal || {};
    const banner = document.getElementById('thermal-banner');
-    if (therm.active) {
+    // Multi-engine: name which engine(s) are cooling and on what (GPU vs CPU).
+    const cooling = data.cooling_engines || [];
+    if (cooling.length) {
+      const parts = cooling.map(c => {
+        const what = (c.gpu != null && c.cpu == null) ? `GPU ${Math.round(c.gpu)}°C`
+                   : (c.cpu != null && c.gpu == null) ? `CPU ${Math.round(c.cpu)}°C`
+                   : (c.message || 'cooling');
+        return `${esc(c.engine)} (${esc(what)})`;
+      });
+      document.getElementById('thermal-banner-msg').textContent =
+        ' Cooling down: ' + parts.join(', ');
+      banner.style.display = '';
+    } else if (therm.active) {
      document.getElementById('thermal-banner-msg').textContent = ' ' + (therm.message || '');
      banner.style.display = '';
    } else {
@@ -207,7 +302,7 @@ async function loadTasks() {
      }
      return `<tr>
        <td><span class="badge badge-user">${esc(KIND_LABEL[t.kind] || t.kind)}</span></td>
-        <td><div class="td-name">${esc(title)}</div><div class="dim small mono">${esc(t.model || '')}</div></td>
+        <td><div class="td-name">${esc(title)}${t.engine?` <span class="badge badge-user" style="font-size:9px;padding:.05rem .3rem;vertical-align:middle" title="Running on engine">${esc(t.engine)}</span>`:''}</div><div class="dim small mono">${esc(t.model || '')}</div></td>
        <td>${statusCell}</td>
        <td>${progressBar(t)}</td>
        <td class="dim small">${fmtTime(t.started_at)}</td>
@@ -248,7 +343,9 @@ async function removeTask(id) {

 loadTasks();
 loadSystemStats();
+loadEngines();
 setInterval(loadTasks, 2000);
 setInterval(loadSystemStats, 2000);
+setInterval(loadEngines, 5000);
 </script>
 {% endblock %}
--- a/codai/api/app.py
+++ b/codai/api/app.py
@@ -160,6 +160,32 @@ except ImportError:
    pass


+class _InternalAuthMiddleware:
+    """Reject any HTTP request that doesn't carry the front's internal token.
+
+    Active only when CODERAI_INTERNAL_TOKEN is set (i.e. this process is an engine
+    spawned by the front). It binds 127.0.0.1, but this also blocks anything else on
+    localhost from talking to the engine directly and bypassing the front. In
+    single-process mode the token is unset and this is a no-op."""
+
+    def __init__(self, app):
+        self._app = app
+        self._token = os.environ.get("CODERAI_INTERNAL_TOKEN")
+
+    async def __call__(self, scope, receive, send):
+        if self._token and scope.get("type") == "http":
+            headers = dict(scope.get("headers", []))
+            got = headers.get(b"x-coderai-internal", b"").decode("latin-1")
+            if got != self._token:
+                await send({"type": "http.response.start", "status": 403,
+                            "headers": [(b"content-type", b"application/json")]})
+                await send({"type": "http.response.body",
+                            "body": b'{"error":"forbidden: engines are reachable only '
+                                    b'through the front proxy"}'})
+                return
+        await self._app(scope, receive, send)
+
+
 class _ForwardedPrefixMiddleware:
    """Populate ASGI root_path from X-Forwarded-Prefix / X-Script-Name headers."""

@@ -180,6 +206,9 @@ class _ForwardedPrefixMiddleware:


 app.add_middleware(_ForwardedPrefixMiddleware)
+# Added last → outermost: the internal-token gate runs before anything else, so a
+# request without the front's token never reaches a route.
+app.add_middleware(_InternalAuthMiddleware)

 # Mount static files for admin dashboard
 from fastapi.staticfiles import StaticFiles
@@ -193,6 +222,77 @@ from fastapi.responses import FileResponse, Response as _FaviconResponse
 _favicon_path = admin_static_dir / "favicon.ico"


+@app.get("/healthz", include_in_schema=False)
+async def healthz():
+    """Cheap liveness probe that touches no torch/model state.
+
+    The front proxy's engine supervisor polls this to distinguish a *slow* engine
+    (busy loading a model — the event loop may be blocked, so this can be late but
+    will eventually answer) from a *dead* one (connection refused). It must stay
+    trivial and dependency-free so it returns the instant the loop is free."""
+    import os as _os
+    return {"ok": True, "pid": _os.getpid()}
+
+
+@app.get("/internal/engine-state", include_in_schema=False)
+async def internal_engine_state():
+    """Auth-free engine introspection for the front proxy's router/aggregator.
+
+    Engines bind 127.0.0.1 only, so this is not publicly reachable. Returns which
+    models are resident (for model→engine routing) and this engine's GPU/VRAM (for
+    cross-engine status aggregation). Kept cheap so it answers even mid-generation.
+    """
+    import os as _os
+    try:
+        loaded = list(multi_model_manager.models.keys())
+    except Exception:
+        loaded = []
+    vram = None
+    try:
+        import torch
+        if torch.cuda.is_available():
+            # Sum across every CUDA device this engine can see — an engine may own
+            # more than one GPU (e.g. two NVIDIA cards sharding one large model), so
+            # reporting only device 0 would under-count its VRAM.
+            n = torch.cuda.device_count()
+            used = free = total = 0
+            devs = []
+            for i in range(n):
+                f, t = torch.cuda.mem_get_info(i)
+                used += (t - f); free += f; total += t
+                devs.append({"index": i, "name": torch.cuda.get_device_name(i),
+                             "free": round(f / 1e9, 2), "total": round(t / 1e9, 2)})
+            label = (torch.cuda.get_device_name(0) if n == 1
+                     else f"{n}× CUDA")
+            vram = {"used": round(used / 1e9, 2), "free": round(free / 1e9, 2),
+                    "total": round(total / 1e9, 2), "gpu": label,
+                    "devices": devs, "device_count": n}
+    except Exception:
+        vram = None
+    # Running tasks so the front can show cross-engine activity without needing a
+    # session on this engine (sessions live only on the primary).
+    tasks = []
+    try:
+        from codai.tasks import task_registry
+        tasks = [t for t in task_registry.list()
+                 if t.get("status") in ("running", "queued", "paused")]
+    except Exception:
+        tasks = []
+    # This engine's thermal cooldown state, so the front can show WHICH engine is
+    # cooling (each engine pauses on its own GPUs; CPU pauses everything).
+    cooling = None
+    try:
+        from codai.models import thermal
+        cs = thermal.get_cooldown_state()
+        if cs.get("active"):
+            cooling = {"gpu": cs.get("gpu"), "cpu": cs.get("cpu"),
+                       "message": cs.get("message")}
+    except Exception:
+        cooling = None
+    return {"ok": True, "pid": _os.getpid(), "loaded_models": loaded,
+            "vram": vram, "tasks": tasks, "cooling": cooling}
+
+
 @app.get("/favicon.ico", include_in_schema=False)
 async def favicon():
    if _favicon_path.exists():

--- a/codai/api/ds4_worker.py
+++ b/codai/api/ds4_worker.py
--- a/codai/api/embeddings.py
+++ b/codai/api/embeddings.py
@@ -106,6 +106,27 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
    """
    OpenAI-compatible embeddings endpoint.
    """
+    # Register a task so embeddings appear in the unified task list, like every
+    # other model type. Finished on success or error below.
+    from codai.tasks import task_registry
+    _title = request.input if isinstance(request.input, str) else "embeddings"
+    _tid = task_registry.register(
+        "embedding", title=str(_title)[:80], model=(request.model or "embedding"))
+    task_registry.start(_tid)
+    try:
+        _resp = await _run_embeddings(request, http_request)
+        task_registry.finish(_tid, "done")
+        return _resp
+    except HTTPException:
+        task_registry.finish(_tid, "error")
+        raise
+    except Exception as e:
+        task_registry.finish(_tid, "error", str(e)[:200])
+        raise
+
+
+async def _run_embeddings(request: EmbeddingsRequest, http_request: Request = None):
+    """Core embeddings logic; registered as a task by create_embeddings()."""
    model_info = await asyncio.to_thread(
        multi_model_manager.request_model, request.model, model_type="embedding")
    model_name = model_info.get('model_name')

--- a/codai/api/parler_worker.py
+++ b/codai/api/parler_worker.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Fully-managed Parler-TTS worker.
+
+parler-tts pins an old transformers/tokenizers/huggingface-hub that conflict with
+the coderai server's stack, so it can't share this venv. Instead coderai owns the
+whole lifecycle here: on first use it bootstraps a dedicated venv (installing
+parler-tts), launches ``tools/parler_tts_service.py`` in it as a local HTTP
+service, health-checks it, and hands back the URL. The matching
+``_RemoteParlerBackend.cleanup()`` calls :func:`stop_service`, so the model
+manager's normal eviction tears the process down — no manual setup or config.
+"""
+
+import os
+import socket
+import subprocess
+import sys
+import threading
+import time
+from pathlib import Path
+
+_REPO_ROOT = Path(__file__).resolve().parents[2]
+_SERVICE_SCRIPT = _REPO_ROOT / "tools" / "parler_tts_service.py"
+
+# Dedicated venv for the (incompatible) parler-tts stack. Created with access to
+# the base interpreter's packages so torch/numpy aren't re-downloaded; parler's
+# pinned transformers installs into the venv and shadows the system one.
+_VENV_DIR = Path(os.environ.get("CODERAI_PARLER_VENV")
+                 or os.path.expanduser("~/.coderai/parler_venv"))
+
+_lock = threading.RLock()
+_services: dict[str, dict] = {}   # model_name -> {"proc","port","url"}
+_bootstrapped = False
+
+
+def _venv_python() -> Path:
+    return _VENV_DIR / ("Scripts" if os.name == "nt" else "bin") / (
+        "python.exe" if os.name == "nt" else "python")
+
+
+def _pip_ok(py: Path) -> bool:
+    try:
+        return subprocess.run([str(py), "-c", "import parler_tts, soundfile"],
+                              capture_output=True).returncode == 0
+    except Exception:
+        return False
+
+
+def _venv_is_system_site() -> bool:
+    """True if the venv was built with --system-site-packages (can't isolate)."""
+    try:
+        return "include-system-site-packages = true" in \
+            (_VENV_DIR / "pyvenv.cfg").read_text().lower()
+    except Exception:
+        return False
+
+
+def _bootstrap_venv() -> Path:
+    """Create a fully-isolated venv and install parler-tts (idempotent).
+
+    Isolation is the whole point: parler-tts pins an old transformers/tokenizers
+    that must NOT be shared with — or shadowed by — the server's stack, so the
+    venv gets its own copy of everything (torch included). Returns its python."""
+    global _bootstrapped
+    py = _venv_python()
+    if _bootstrapped and py.exists():
+        return py
+    # A previously-created shared-site venv leaks the server's transformers in;
+    # rebuild it isolated.
+    if py.exists() and _venv_is_system_site():
+        import shutil
+        print("[parler] rebuilding venv as fully isolated …", flush=True)
+        shutil.rmtree(_VENV_DIR, ignore_errors=True)
+    if not _venv_python().exists():
+        print(f"[parler] creating isolated venv at {_VENV_DIR} …", flush=True)
+        _VENV_DIR.parent.mkdir(parents=True, exist_ok=True)
+        subprocess.run([sys.executable, "-m", "venv", str(_VENV_DIR)], check=True)
+    py = _venv_python()
+    if not _pip_ok(py):
+        print("[parler] installing parler-tts + torch into the isolated venv "
+              "(first run, downloads several GB, this can take a while) …", flush=True)
+        subprocess.run([str(py), "-m", "pip", "install",
+                        "git+https://github.com/huggingface/parler-tts.git",
+                        "soundfile"], check=True)
+        if not _pip_ok(py):
+            raise RuntimeError("parler-tts install did not yield an importable package")
+    _bootstrapped = True
+    return py
+
+
+def _free_port() -> int:
+    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    s.bind(("127.0.0.1", 0))
+    port = s.getsockname()[1]
+    s.close()
+    return port
+
+
+def _pump_logs(proc: subprocess.Popen, tail):
+    for line in proc.stdout:
+        line = line.rstrip()
+        if line:
+            tail.append(line)
+            print(f"[parler] {line}", flush=True)
+
+
+def _health_ok(url: str) -> bool:
+    import requests
+    try:
+        r = requests.get(url + "/health", timeout=3)
+        return r.ok and bool(r.json().get("ok"))
+    except Exception:
+        return False
+
+
+def ensure_service(model_name: str, ready_timeout: float = 1800.0) -> str:
+    """Start (or reuse) the worker for ``model_name`` and return its base URL.
+
+    First call bootstraps the venv and downloads the model, so the timeout is
+    generous. Raises RuntimeError if the service never comes up."""
+    with _lock:
+        svc = _services.get(model_name)
+        if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
+            return svc["url"]
+        if svc and svc["proc"].poll() is not None:
+            _services.pop(model_name, None)   # died — restart below
+
+        py = _bootstrap_venv()
+        port = _free_port()
+        url = f"http://127.0.0.1:{port}"
+        env = dict(os.environ)
+        # The worker must use the model already pulled via coderai's HF download
+        # interface — it never downloads anything itself. Point it at coderai's
+        # cache and force offline mode, so a missing model fails fast instead of
+        # silently fetching.
+        try:
+            from codai.models.cache import get_hf_hub_cache_dir
+            hub = get_hf_hub_cache_dir()
+            env["HF_HUB_CACHE"] = hub
+            env["HUGGINGFACE_HUB_CACHE"] = hub
+        except Exception:
+            pass
+        env["HF_HUB_OFFLINE"] = "1"
+        env["TRANSFORMERS_OFFLINE"] = "1"
+        proc = subprocess.Popen(
+            [str(py), str(_SERVICE_SCRIPT), "--model", model_name,
+             "--host", "127.0.0.1", "--port", str(port)],
+            stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True,
+            bufsize=1, env=env, cwd=str(_REPO_ROOT),
+        )
+        import collections
+        tail = collections.deque(maxlen=15)
+        threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
+        _services[model_name] = {"proc": proc, "port": port, "url": url}
+
+    def _tail_msg():
+        joined = " | ".join(list(tail)[-5:]).strip()
+        if "offline" in joined.lower() or "not" in joined.lower() and "found" in joined.lower():
+            return (f". The model isn't in coderai's cache — download "
+                    f"'{model_name}' from the model interface first. ({joined})")
+        return f". Last output: {joined}" if joined else ""
+
+    # Wait (outside the lock) for the service to load the model and answer.
+    deadline = time.time() + ready_timeout
+    while time.time() < deadline:
+        if proc.poll() is not None:
+            raise RuntimeError(
+                f"Parler worker exited (code {proc.returncode}) before becoming ready"
+                + _tail_msg())
+        if _health_ok(url):
+            print(f"[parler] service ready for {model_name} at {url}", flush=True)
+            return url
+        time.sleep(2)
+    stop_service(model_name)
+    raise RuntimeError(f"Parler worker for {model_name} did not become ready in time"
+                       + _tail_msg())
+
+
+def stop_service(model_name: str) -> None:
+    with _lock:
+        svc = _services.pop(model_name, None)
+    if not svc:
+        return
+    proc = svc["proc"]
+    if proc.poll() is None:
+        try:
+            proc.terminate()
+            proc.wait(timeout=10)
+        except Exception:
+            pass
+    if proc.poll() is None:
+        try:
+            proc.kill()
+        except Exception:
+            pass
+    print(f"[parler] service for {model_name} stopped", flush=True)
+
+
+def stop_all() -> None:
+    for name in list(_services.keys()):
+        stop_service(name)
+
+
+import atexit as _atexit
+_atexit.register(stop_all)
--- a/codai/api/spatial.py
+++ b/codai/api/spatial.py
@@ -45,6 +45,31 @@ global_args = None
 global_file_path = None


+def _spatial_task(title: str):
+    """Decorator: register a spatial/3D endpoint in the unified task list so
+    every model type is visible there. Finishes done/error around the call."""
+    import functools
+
+    def deco(fn):
+        @functools.wraps(fn)
+        async def wrap(*args, **kwargs):
+            from codai.tasks import task_registry
+            tid = task_registry.register("spatial", title=title, model="spatial")
+            task_registry.start(tid)
+            try:
+                result = await fn(*args, **kwargs)
+                task_registry.finish(tid, "done")
+                return result
+            except HTTPException:
+                task_registry.finish(tid, "error")
+                raise
+            except Exception as e:
+                task_registry.finish(tid, "error", str(e)[:200])
+                raise
+        return wrap
+    return deco
+
+
 def set_global_args(args):
    global global_args
    global_args = args
@@ -500,6 +525,7 @@ class ImageTo3DRequest(BaseModel):


 @router.post("/v1/images/to3d", summary="Image to 3D model")
+@_spatial_task("Image → 3D")
 async def image_to_3d(request: ImageTo3DRequest, http_request: Request = None):
    """Convert a 2D image to a 3D representation.

@@ -568,6 +594,7 @@ class ImageFrom3DRequest(BaseModel):


 @router.post("/v1/images/from3d", summary="Render a 3D model to an image")
+@_spatial_task("3D → image")
 async def image_from_3d(request: ImageFrom3DRequest, http_request: Request = None):
    """Render a 3D model (GLB/OBJ) to a 2D PNG image from a specified camera angle."""
    raw = _decode_b64(request.model_data)
@@ -601,6 +628,7 @@ class VideoTo3DRequest(BaseModel):


 @router.post("/v1/video/to3d", summary="Video to 3D model")
+@_spatial_task("Video → 3D")
 async def video_to_3d(request: VideoTo3DRequest, http_request: Request = None):
    """Convert a 2D video to a 3D video frame-by-frame.

@@ -642,6 +670,7 @@ class VideoFrom3DRequest(BaseModel):


 @router.post("/v1/video/from3d", summary="Render a 3D model to a video")
+@_spatial_task("3D → video")
 async def video_from_3d(request: VideoFrom3DRequest, http_request: Request = None):
    """Render a 3D model as a 360° turntable video."""
    raw = _decode_b64(request.model_data)
@@ -675,6 +704,7 @@ class Generate3DRequest(BaseModel):


 @router.post("/v1/3d/generate", summary="Generate a 3D model from a prompt")
+@_spatial_task("Generate 3D")
 async def generate_3d(request: Generate3DRequest, http_request: Request = None):
    """Generate a 3D model (GLB) from a text prompt and/or an image.


--- a/codai/api/text.py
+++ b/codai/api/text.py
--- a/codai/api/transcriptions.py
+++ b/codai/api/transcriptions.py
@@ -135,6 +135,32 @@ async def create_transcription(
    if len(file_content) > _MAX_AUDIO_BYTES:
        raise HTTPException(status_code=413, detail="Audio file too large (max 100 MB)")

+    # Register a task so transcription appears in the unified task list, like
+    # every other model type. Finished on success or error below.
+    from codai.tasks import task_registry
+    _tid = task_registry.register(
+        "transcription",
+        title=(file.filename or "audio")[:80],
+        model=model or "",
+    )
+    task_registry.start(_tid)
+    try:
+        _resp = await _run_transcription(
+            file_content, model, language, prompt, response_format, temperature, file)
+        task_registry.finish(_tid, "done")
+        return _resp
+    except HTTPException:
+        task_registry.finish(_tid, "error")
+        raise
+    except Exception as e:
+        task_registry.finish(_tid, "error", str(e)[:200])
+        raise
+
+
+async def _run_transcription(
+    file_content: bytes, model: str, language, prompt, response_format, temperature, file
+):
+    """Core transcription logic; registered as a task by create_transcription()."""
    # Check if the requested model maps to a configured whisper-server instance first.
    # Try alias round-robin resolution before direct ID lookup.
    whisper_model_id = multi_model_manager.resolve_whisper_alias_model_id(model)

--- a/codai/api/tts.py
+++ b/codai/api/tts.py
@@ -28,6 +28,7 @@ from pydantic import BaseModel, ConfigDict

 # Import from codai modules
 from codai.models.manager import multi_model_manager
+from codai.api import tts_backends


 # Global reference to be set by coderai
@@ -40,6 +41,20 @@ def set_global_args(args):
    global_args = args


+# Substrings that mark a model as a text/classifier/embedding model wrongly routed
+# to TTS (e.g. an emotion classifier exposed under a stray ``tts:`` alias).
+_NON_TTS_HINTS = (
+    "go_emotions", "roberta", "bert", "embedding", "e5-", "minilm",
+    "classifier", "toxic", "reranker", "sentence-transformers",
+)
+
+
+def _family_is_text_model(model_name: str) -> bool:
+    """Heuristic guard: True when the model is clearly not a speech synthesizer."""
+    n = (model_name or "").lower()
+    return any(h in n for h in _NON_TTS_HINTS)
+
+
 # =============================================================================
 # Router and Endpoints
 # =============================================================================
@@ -72,6 +87,16 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
    Supports:
    - Kokoro TTS models (when --tts-model is specified)
    """
+    # Register a task so TTS shows up in the unified task list / dashboard,
+    # like every other model type. Finished on success or error below.
+    from codai.tasks import task_registry, loading_task
+    _tid = task_registry.register(
+        "tts",
+        title=(request.input or "")[:80],
+        model=(request.model or request.voice_profile or "tts"),
+    )
+    task_registry.start(_tid)
+    try:
        # If a voice profile is requested, delegate to voice cloning (F5-TTS)
        if request.voice_profile:
            from codai.api.voice_clone import _load_voice, _f5tts_clone
@@ -96,6 +121,7 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
            except Exception as e:
                raise HTTPException(status_code=500, detail=f"Voice cloning failed: {e}")
            audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
+            task_registry.finish(_tid, "done")
            return {"audio": audio_base64}

        # Use the manager to resolve the model and manage VRAM
@@ -111,7 +137,7 @@ async def create_speech(request: TTSRequest, http_request: Request = None):

        model_name = model_info['model_name']
        model_key = model_info['model_key']
-    kokoro_model = model_info['model_object']
+        tts_backend = model_info['model_object']

        # If no TTS model configured, return an error
        if not model_name:
@@ -120,35 +146,42 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
                detail="TTS not configured. Use --tts-model to specify a model."
            )

-    # Try to use kokoro if available
-    try:
-        from kokoro import Kokoro
+        # Reject text/classifier models that aren't actually speech synthesizers.
+        if _family_is_text_model(model_name):
+            raise HTTPException(
+                status_code=404,
+                detail=(f"Model '{model_name}' is a text model and cannot be used for "
+                        "tts generation. Use a TTS model (e.g. a kokoro/XTTS/Bark model).")
+            )

-        if kokoro_model is None:
-            print(f"Loading Kokoro TTS model: {model_name}")
+        try:
+            from codai.api import tts_backends

-            # Check if model_name is a URL - download it (with caching)
-            model_path = None
-            if model_name.startswith('http://') or model_name.startswith('https://'):
-                print(f"Loading model from URL: {model_name}")
-                from codai.models.cache import load_model
-                model_path = load_model(model_name)
-                if not model_path:
-                    raise Exception(f"Failed to load model from {model_name}")
-            else:
-                # Use local path or model name
+            if tts_backend is None:
+                print(f"Loading TTS model: {model_name}")
                model_path = model_name
-            
-            # Load the Kokoro model
-            kokoro_model = Kokoro(model_path if model_path else model_name)
-            multi_model_manager.add_model(model_key, kokoro_model)
+                if model_name.startswith(('http://', 'https://')):
+                    from codai.models.cache import load_model
+                    model_path = load_model(model_name) or model_name
+                cfg = multi_model_manager.config.get(model_key) or \
+                    multi_model_manager.config.get(f"tts:{model_name}") or {}
+                with loading_task(model_name, model_type="tts"):
+                    tts_backend = await asyncio.to_thread(
+                        tts_backends.load_backend, model_name, model_path, cfg)
+                multi_model_manager.add_model(model_key, tts_backend)
                multi_model_manager.current_model_key = model_key

-        # Generate speech
-        voice = request.voice or "af_sarah"
+            voice = request.voice or getattr(tts_backend, "default_voice", "")
            speed = request.speed or 1.0
+            lang = getattr(request, "language", None) or "en-us"
+            emotion = getattr(request, "emotion", None) or ""
+            style = getattr(request, "style", None) or ""
+            fmt = request.response_format or "wav"

-        audio_bytes = kokoro_model.generate(request.input, voice=voice, speed=speed)
+            samples, sample_rate = await asyncio.to_thread(
+                tts_backend.synthesize, request.input, voice, speed, lang, emotion, style)
+            audio_bytes, out_fmt = await asyncio.to_thread(
+                tts_backends.encode_audio, samples, sample_rate, fmt)

            try:
                from codai.api.archive import archive_manager
@@ -157,27 +190,29 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
                    "tts", "/v1/audio/speech",
                    model_name,
                    request.input,
-                {"voice": voice, "speed": speed, "response_format": request.response_format},
-                [(audio_bytes, request.response_format or "mp3")],
+                    {"voice": voice, "speed": speed, "response_format": out_fmt},
+                    [(audio_bytes, out_fmt)],
                ))
            except Exception:
                pass

-        # Convert to base64
            audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
+            task_registry.finish(_tid, "done")
+            return {"audio": audio_base64}

-        return {
-            "audio": audio_base64
-        }
-        
-    except ImportError as e:
-        # kokoro not installed
-        raise HTTPException(
-            status_code=501,
-            detail=f"TTS not available. Install kokoro: pip install kokoro. Error: {str(e)}"
-        )
+        except HTTPException:
+            raise
+        except tts_backends.MissingEngineError as e:
+            # Missing optional engine (e.g. coqui-tts) → actionable 501.
+            raise HTTPException(status_code=501, detail=str(e))
        except Exception as e:
            print(f"TTS error: {e}")
            import traceback
            traceback.print_exc()
            raise HTTPException(status_code=500, detail=f"TTS error: {str(e)}")
+    except HTTPException:
+        task_registry.finish(_tid, "error")
+        raise
+    except Exception as e:
+        task_registry.finish(_tid, "error", str(e)[:200])
+        raise
\ No newline at end of file
--- a/codai/api/tts_backends.py
+++ b/codai/api/tts_backends.py
--- a/codai/backends/cuda.py
+++ b/codai/backends/cuda.py
--- a/codai/backends/ds4.py
+++ b/codai/backends/ds4.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""ds4 (DeepSeek V4) proxy backend.
+
+ds4-server already speaks the OpenAI HTTP API, so this backend is a thin proxy: it
+forwards chat/completion requests to the managed ``ds4-server`` subprocess (whose
+lifecycle is owned by :mod:`codai.api.ds4_worker`) and adapts the responses to the
+:class:`~codai.backends.base.ModelBackend` contract the model manager expects.
+
+Tool/think parsing is handled the same way as the other backends — by
+``ModelParserAdapter`` over the returned text — so tools are not forwarded to
+ds4-server; the text-level ``DeepSeekParser`` extracts ``<think>`` and tool calls.
+"""
+
+import asyncio
+import threading
+from typing import AsyncGenerator, Dict, List, Optional
+
+from codai.backends.base import ModelBackend
+
+
+class Ds4Backend(ModelBackend):
+    """Proxy backend that routes generation to a managed ds4-server."""
+
+    def __init__(self, cfg=None):
+        # cfg is a codai.config.Ds4Config. When omitted, resolve the active one.
+        if cfg is None:
+            from codai.config import Ds4Config
+            cfg = Ds4Config()
+        self._cfg = cfg
+        self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
+        self._url: Optional[str] = None
+        self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
+        self._last_usage: Dict = {}
+
+    # ------------------------------------------------------------------ #
+    # lifecycle
+    # ------------------------------------------------------------------ #
+    def load_model(self, model_name: str, **kwargs) -> None:
+        from codai.api import ds4_worker
+        if model_name:
+            self._model_id = model_name
+        self._url = ds4_worker.ensure_service(self._cfg)
+
+    def get_model_name(self) -> str:
+        return self._model_id
+
+    def get_context_size(self) -> int:
+        return self._ctx
+
+    def get_last_usage(self) -> dict:
+        return dict(self._last_usage)
+
+    def cleanup(self) -> None:
+        from codai.api import ds4_worker
+        ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id))
+        self._url = None
+
+    # ------------------------------------------------------------------ #
+    # helpers
+    # ------------------------------------------------------------------ #
+    def _base(self) -> str:
+        if not self._url:
+            raise RuntimeError("ds4 service not started")
+        return self._url
+
+    def _store_usage(self, usage: dict) -> None:
+        if usage:
+            self._last_usage = {
+                "prompt_tokens": usage.get("prompt_tokens", 0),
+                "completion_tokens": usage.get("completion_tokens", 0),
+                "total_tokens": usage.get("total_tokens", 0),
+            }
+
+    def format_messages(self, messages) -> str:
+        # ds4-server applies DeepSeek V4's own chat template server-side; this is only
+        # used by callers that need a flat prompt string.
+        parts = []
+        for m in messages:
+            role = m.get("role") if isinstance(m, dict) else getattr(m, "role", "")
+            content = m.get("content") if isinstance(m, dict) else getattr(m, "content", "")
+            parts.append(f"{role}: {content}")
+        return "\n".join(parts)
+
+    def _chat_payload(self, messages, max_tokens, temperature, top_p, stop, stream):
+        payload = {
+            "model": self._model_id,
+            "messages": messages,
+            "temperature": temperature,
+            "top_p": top_p,
+            "stream": stream,
+        }
+        if max_tokens:
+            payload["max_tokens"] = max_tokens
+        if stop:
+            payload["stop"] = stop
+        return payload
+
+    # ------------------------------------------------------------------ #
+    # chat-level generation (preferred by the manager)
+    # ------------------------------------------------------------------ #
+    def generate_chat(self, messages: List[Dict], max_tokens=None, temperature=0.7,
+                      top_p=1.0, stop=None, tools=None, response_format=None):
+        import requests
+        payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, False)
+        if response_format and response_format.get("type") == "json_object":
+            payload["response_format"] = {"type": "json_object"}
+        r = requests.post(self._base() + "/v1/chat/completions", json=payload, timeout=3600)
+        r.raise_for_status()
+        data = r.json()
+        self._store_usage(data.get("usage", {}))
+        return data["choices"][0]["message"].get("content") or ""
+
+    async def generate_chat_stream(self, messages: List[Dict], max_tokens=None,
+                                   temperature=0.7, top_p=1.0, stop=None, tools=None,
+                                   response_format=None) -> AsyncGenerator[str, None]:
+        payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, True)
+        async for chunk in self._stream(self._base() + "/v1/chat/completions", payload,
+                                        delta_key="delta"):
+            yield chunk
+
+    # ------------------------------------------------------------------ #
+    # plain completion (fallback path)
+    # ------------------------------------------------------------------ #
+    def generate(self, prompt: str, max_tokens=None, temperature: float = 0.7,
+                 top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
+                 presence_penalty: float = 0.0, frequency_penalty: float = 0.0) -> str:
+        return self.generate_chat([{"role": "user", "content": prompt}],
+                                  max_tokens, temperature, top_p, stop)
+
+    async def generate_stream(self, prompt: str, max_tokens=None, temperature: float = 0.7,
+                              top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
+                              presence_penalty: float = 0.0,
+                              frequency_penalty: float = 0.0) -> AsyncGenerator[str, None]:
+        async for chunk in self.generate_chat_stream(
+                [{"role": "user", "content": prompt}], max_tokens, temperature, top_p, stop):
+            yield chunk
+
+    # ------------------------------------------------------------------ #
+    # SSE streaming: iterate the blocking requests stream on a worker thread
+    # and hand chunks to the event loop through an asyncio.Queue.
+    # ------------------------------------------------------------------ #
+    async def _stream(self, url: str, payload: dict, delta_key: str
+                      ) -> AsyncGenerator[str, None]:
+        import json
+        loop = asyncio.get_event_loop()
+        queue: asyncio.Queue = asyncio.Queue()
+        _SENTINEL = object()
+
+        def _worker():
+            import requests
+            try:
+                with requests.post(url, json=payload, stream=True, timeout=3600) as r:
+                    r.raise_for_status()
+                    for raw in r.iter_lines(decode_unicode=True):
+                        if not raw or not raw.startswith("data:"):
+                            continue
+                        data = raw[len("data:"):].strip()
+                        if data == "[DONE]":
+                            break
+                        try:
+                            obj = json.loads(data)
+                        except ValueError:
+                            continue
+                        choice = (obj.get("choices") or [{}])[0]
+                        text = (choice.get(delta_key) or {}).get("content") or ""
+                        if text:
+                            loop.call_soon_threadsafe(queue.put_nowait, text)
+                        if obj.get("usage"):
+                            self._store_usage(obj["usage"])
+                        if choice.get("finish_reason"):
+                            break
+            except Exception as exc:  # surface to the consumer
+                loop.call_soon_threadsafe(queue.put_nowait, exc)
+            finally:
+                loop.call_soon_threadsafe(queue.put_nowait, _SENTINEL)
+
+        threading.Thread(target=_worker, daemon=True).start()
+        while True:
+            item = await queue.get()
+            if item is _SENTINEL:
+                break
+            if isinstance(item, Exception):
+                raise item
+            yield item
--- a/codai/backends/vulkan.py
+++ b/codai/backends/vulkan.py
@@ -621,6 +621,27 @@ class VulkanBackend(ModelBackend):
            else:
                raise ValueError(f"Could not cache model from URL: {model_path}")
        
+        # Fallback: a configured .gguf path that no longer exists (e.g. the file was
+        # downloaded into the GGUF cache rather than the HF-hub snapshot the entry
+        # points at, or a stale snapshot hash). Look for the same filename in the
+        # GGUF cache dir before giving up — the model loads without re-editing the
+        # config entry.
+        if model_path.endswith('.gguf') and not os.path.exists(model_path):
+            try:
+                from codai.models.cache import get_model_cache_dir
+                _base = os.path.basename(model_path)
+                _cache = get_model_cache_dir()
+                _cand = os.path.join(_cache, _base)
+                if not os.path.exists(_cand):
+                    import glob as _glob
+                    _hits = _glob.glob(os.path.join(_cache, "**", _base), recursive=True)
+                    _cand = _hits[0] if _hits else _cand
+                if os.path.exists(_cand):
+                    print(f"  Model path missing; resolved from GGUF cache: {_cand}")
+                    model_path = _cand
+            except Exception:
+                pass
+
        if not os.path.exists(model_path):
            raise FileNotFoundError(f"Model file not found: {model_path}")
        

--- a/codai/broker/capabilities.py
+++ b/codai/broker/capabilities.py
@@ -49,7 +49,13 @@ def build_hardware_summary() -> Dict[str, Any]:
    total_vram_mb = 0
    available_vram_mb = 0

+    # Only use torch if it's ALREADY loaded (i.e. we're in an engine). Never import
+    # it here — the front is torch-free and must stay that way (importing torch in
+    # the front is heavy and would initialise CUDA in the wrong process).
+    import sys as _sys
    try:
+        if "torch" not in _sys.modules:
+            raise ImportError("torch not loaded (front) — using torch-free path")
        import torch

        if torch.cuda.is_available():
@@ -76,6 +82,23 @@ def build_hardware_summary() -> Dict[str, Any]:
    except Exception:
        pass

+    # Torch-free path (e.g. the front, which imports no torch): enumerate every
+    # physical card via nvidia-smi + sysfs so VRAM is reported for the whole node.
+    if not gpus:
+        try:
+            from codai.frontproxy.gpu_detect import gpu_stats
+            for c in gpu_stats():
+                total_mb = int(round((c.get("mem_total") or 0) * 1024))
+                used_mb = int(round((c.get("mem_used") or 0) * 1024))
+                if total_mb <= 0:
+                    continue
+                gpus.append({"name": c.get("name") or c.get("vendor"),
+                             "total_vram_mb": total_mb})
+                total_vram_mb += total_mb
+                available_vram_mb += max(0, total_mb - used_mb)
+        except Exception:
+            pass
+
    if not gpus:
        for total_path in sorted(glob.glob("/sys/class/drm/card*/device/mem_info_vram_total")):
            used_path = total_path.replace("vram_total", "vram_used")

--- a/codai/broker/dispatcher.py
+++ b/codai/broker/dispatcher.py
@@ -60,8 +60,13 @@ def _is_text_response(content_type: str | None) -> bool:
    )


-async def execute_broker_request(app, envelope):
-    """Validate and execute a broker request envelope."""
+async def execute_broker_request(app, envelope, executor=None):
+    """Validate and execute a broker request envelope.
+
+    ``executor`` is an ``async (method, path, headers, query, body) -> {status_code,
+    headers, body}`` callable. When omitted the request is run in-process against
+    ``app`` via the ASGI bridge (engine / single-process mode). The front passes its
+    own executor that proxies to the right engine over HTTP."""

    logger.debug(
        "broker dispatch → op=%s request_id=%s path=%r method=%r stream=%s",
@@ -136,6 +141,12 @@ async def execute_broker_request(app, envelope):
        headers["content-type"] = envelope.content_type

    started_at = perf_counter()
+    if executor is not None:
+        response = await executor(
+            method=envelope.method, path=envelope.path, headers=headers,
+            query=envelope.query, body=body,
+        )
+    else:
        response = await execute_internal_request(
            app,
            method=envelope.method,

--- a/codai/cli.py
+++ b/codai/cli.py
@@ -224,6 +224,13 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
        action="store_true",
        help="Dump model output: raw output, parsed output, and litellm debug info",
    )
+    parser.add_argument(
+        "--debug-requests",
+        action="store_true",
+        help="Log the full request/response payloads exchanged with API clients "
+             "(opencode, etc.): incoming messages + tools and the outgoing "
+             "content/tool_calls. Use to diagnose agentic tool-call loops.",
+    )
    parser.add_argument(
        "--list-cached-models",
        action="store_true",
@@ -278,4 +285,39 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
        help="Ignore any existing pipeline cache and rebuild it from scratch this "
             "run (use after changing a model's quantization/precision config).",
    )
+    # ─── Frontend/engine split ───────────────────────────────────────────────
+    parser.add_argument(
+        "--single-process",
+        action="store_true",
+        help="Run the legacy single-process server (UI/API and all model work in "
+             "one process). Default boots a front proxy + supervised engine "
+             "subprocess(es) so the web UI stays responsive during model work.",
+    )
+    parser.add_argument(
+        "--engine-only",
+        action="store_true",
+        help="Run this process as an engine (binds an internal localhost port, no "
+             "front proxy). Normally launched automatically by the front; not "
+             "intended to be run by hand.",
+    )
+    parser.add_argument(
+        "--internal-port",
+        type=int,
+        default=None,
+        help="Internal port for --engine-only mode (the front assigns one per engine).",
+    )
+    parser.add_argument(
+        "--debug-engine",
+        action="store_true",
+        help="General engine debugging in the front/engine split (engine lifecycle, "
+             "spawn details, health transitions). Does NOT include the internal "
+             "HTTP access log — use --debug-engine-web for that.",
+    )
+    parser.add_argument(
+        "--debug-engine-web",
+        action="store_true",
+        help="Show the internal front↔engine HTTP requests in an engine's access log "
+             "(proxied calls, /internal/engine-state, /healthz, …). Suppressed by "
+             "default since every engine only ever serves internal front traffic.",
+    )
    return parser.parse_args()
--- a/codai/config.py
+++ b/codai/config.py
--- a/codai/frontproxy/__init__.py
+++ b/codai/frontproxy/__init__.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Front proxy package: always-responsive web/API front + supervised engines.
+
+See ``docs/frontend-engine-split.md`` and ``docs/process-isolation-plans.md``.
+"""
+
+from codai.frontproxy.app import run_front, build_app
+
+__all__ = ["run_front", "build_app"]
--- a/codai/frontproxy/app.py
+++ b/codai/frontproxy/app.py
--- a/codai/frontproxy/assignment.py
+++ b/codai/frontproxy/assignment.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Assign each configured model to exactly one engine.
+
+With multiple engines, every engine would otherwise read the shared models.json and
+register *every* model — so a model would appear on several engines at once. The
+front instead computes a single **owner** per model and tells each engine which
+models it owns; the engine then registers only those.
+
+Owner precedence (per model):
+  1. The per-model ``engine`` pin (models.json), if that engine can run the model.
+  2. The configured default engine, if it can run the model.
+  3. Round-robin across the capability-compatible engines (balanced, deterministic),
+     so unpinned models spread out instead of all landing on one engine.
+
+A model whose format no engine can serve is left unassigned (it can't run anyway).
+"""
+
+import json
+
+# models.json categories that hold servable model entries.
+_CATEGORIES = (
+    "text_models", "gguf_models", "vision_models", "image_models",
+    "audio_models", "tts_models", "video_models", "audio_gen_models",
+    "embedding_models", "spatial_models",
+)
+
+
+def _entry_path(entry):
+    """The model's path/id — used for capability detection (e.g. is it a .gguf)."""
+    if isinstance(entry, str):
+        return entry
+    if isinstance(entry, dict):
+        return entry.get("path") or entry.get("id")
+    return None
+
+
+def _route_key(entry):
+    """The identifier clients address this entry by (alias > path > id).
+
+    Keying on the alias lets two *configs* of the same model — with distinct
+    aliases — be assigned to different engines; configs sharing a path with no
+    distinct alias collapse to one owner (they're not separately addressable)."""
+    if isinstance(entry, str):
+        return entry
+    if isinstance(entry, dict):
+        return entry.get("alias") or entry.get("path") or entry.get("id")
+    return None
+
+
+def _required_cap(entry, ds4_cfg):
+    from codai.frontproxy.router import required_capability
+    path = _entry_path(entry) or ""
+    backend = entry.get("backend") if isinstance(entry, dict) else None
+    return required_capability(
+        path, backend=backend,
+        ds4_model_id=getattr(ds4_cfg, "model_id", None) if ds4_cfg else None,
+        ds4_enabled=bool(getattr(ds4_cfg, "enabled", False)) if ds4_cfg else False)
+
+
+def compute_assignment(engines, models_path, default_engine=None, ds4_cfg=None):
+    """Return {engine_name: [model_identifiers]} — each model owned by one engine."""
+    assignment = {e.name: [] for e in engines}
+    if not engines or not models_path:
+        return assignment
+    try:
+        with open(models_path) as f:
+            data = json.load(f)
+    except Exception:
+        return assignment
+
+    default_engine = (default_engine or "").strip().lower()
+    rr = {}   # round-robin cursor per candidate-set signature
+    seen = set()
+
+    for cat in _CATEGORIES:
+        for entry in data.get(cat, []):
+            ident = _route_key(entry)
+            if not ident or ident in seen:
+                continue
+            cap = _required_cap(entry, ds4_cfg)
+            candidates = [e for e in engines if e.can_serve(cap)]
+            if not candidates:
+                continue   # nothing can run it — leave unassigned
+
+            owner = None
+            pin = ((entry.get("engine") if isinstance(entry, dict) else "") or "").strip().lower()
+            if pin:
+                owner = next((e for e in candidates
+                              if e.name.lower() == pin or (e.backend or "").lower() == pin), None)
+            if owner is None and default_engine:
+                owner = next((e for e in candidates
+                              if e.name.lower() == default_engine
+                              or (e.backend or "").lower() == default_engine), None)
+            if owner is None:
+                key = tuple(sorted(e.name for e in candidates))
+                i = rr.get(key, 0)
+                owner = candidates[i % len(candidates)]
+                rr[key] = i + 1
+
+            assignment[owner.name].append(ident)
+            seen.add(ident)
+
+    return assignment
--- a/codai/frontproxy/engine_supervisor.py
+++ b/codai/frontproxy/engine_supervisor.py
--- a/codai/frontproxy/gpu_detect.py
+++ b/codai/frontproxy/gpu_detect.py
--- a/codai/frontproxy/registry.py
+++ b/codai/frontproxy/registry.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+
+"""Front-side registry of engine subprocesses.
+
+The front never imports torch; it knows about engines only through the small,
+auth-free ``/internal/engine-state`` endpoint each engine exposes on localhost.
+This module holds the shared, thread-safe view the supervisor writes and the
+router/aggregator read.
+"""
+
+import threading
+import time
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional, Set
+
+
+# Default model-format capabilities implied by an engine's backend:
+#   transformers — safetensors/HF models (CUDA only here)
+#   gguf         — llama.cpp models (CUDA or Vulkan)
+#   whisper      — whisper.cpp STT (CUDA or Vulkan)
+#   ds4          — DeepSeek V4 via the native ds4 engine (CUDA-only build)
+# An NVIDIA engine can do all of them; a Vulkan (e.g. Radeon) engine does GGUF and
+# whisper, but not transformers and not ds4.
+_DEFAULT_CAPS = {
+    "nvidia": {"transformers", "gguf", "whisper", "ds4"},
+    "cuda": {"transformers", "gguf", "whisper", "ds4"},
+    "vulkan": {"gguf", "whisper"},
+    "opencl": {"gguf", "whisper"},
+    "auto": {"transformers", "gguf", "whisper", "ds4"},
+}
+
+
+@dataclass
+class Engine:
+    id: int
+    gpu: Optional[int]             # device hint for logs (CUDA/Vulkan index; None = n/a)
+    port: int
+    primary: bool = False          # the engine that owns admin/auth/config traffic
+    name: str = ""                 # human label for logs
+    backend: str = "auto"          # nvidia | vulkan | … (forced for this engine)
+    env: dict = field(default_factory=dict)        # extra env applied at spawn
+    capabilities: Set[str] = field(default_factory=set)  # model formats it can serve
+    assigned_models: Set[str] = field(default_factory=set)  # routable ids it owns
+    url: str = ""
+    healthy: bool = False
+    loaded_models: Set[str] = field(default_factory=set)
+    vram: Optional[dict] = None
+    tasks: list = field(default_factory=list)   # running/queued tasks on this engine
+    cooling: Optional[dict] = None  # thermal cooldown state, or None when not cooling
+    last_ok: float = 0.0           # monotonic time of last successful poll
+    proc: object = None            # subprocess.Popen (set by the supervisor)
+
+    def __post_init__(self):
+        if not self.url:
+            self.url = f"http://127.0.0.1:{self.port}"
+        if not self.name:
+            self.name = f"engine#{self.id}"
+        if not self.capabilities:
+            self.capabilities = set(_DEFAULT_CAPS.get(self.backend, {"transformers", "gguf"}))
+
+    def can_serve(self, required_cap: Optional[str]) -> bool:
+        return (not required_cap) or (required_cap in self.capabilities)
+
+
+class EngineRegistry:
+    def __init__(self):
+        self._engines: Dict[int, Engine] = {}
+        self._lock = threading.RLock()
+
+    def add(self, engine: Engine) -> None:
+        with self._lock:
+            self._engines[engine.id] = engine
+
+    def get(self, engine_id: int) -> Optional[Engine]:
+        with self._lock:
+            return self._engines.get(engine_id)
+
+    def all(self) -> List[Engine]:
+        with self._lock:
+            return list(self._engines.values())
+
+    def healthy(self) -> List[Engine]:
+        with self._lock:
+            return [e for e in self._engines.values() if e.healthy]
+
+    def primary(self) -> Optional[Engine]:
+        """The engine that owns admin/session/config — falls back to first healthy."""
+        with self._lock:
+            prim = next((e for e in self._engines.values() if e.primary), None)
+            if prim and prim.healthy:
+                return prim
+            return next((e for e in self._engines.values() if e.healthy), prim)
+
+    def by_name(self, name: Optional[str]) -> Optional[Engine]:
+        """Resolve an engine by its declared name (or, failing that, its backend).
+
+        Used for the configured default engine and per-model pins. Prefers a healthy
+        match but returns an unhealthy one too, so callers can decide."""
+        if not name:
+            return None
+        name = name.strip().lower()
+        with self._lock:
+            engines = list(self._engines.values())
+        match = None
+        for e in engines:
+            if (e.name or "").lower() == name or (e.backend or "").lower() == name:
+                if e.healthy:
+                    return e
+                match = match or e
+        return match
+
+    def update_state(self, engine_id: int, *, healthy: bool,
+                     loaded_models=None, vram=None, tasks=None,
+                     cooling=False) -> None:
+        with self._lock:
+            e = self._engines.get(engine_id)
+            if not e:
+                return
+            e.healthy = healthy
+            if healthy:
+                e.last_ok = time.monotonic()
+            if loaded_models is not None:
+                e.loaded_models = set(loaded_models)
+            if vram is not None:
+                e.vram = vram
+            if tasks is not None:
+                e.tasks = list(tasks)
+            elif not healthy:
+                e.tasks = []
+            if cooling is not False:        # explicit None clears it
+                e.cooling = cooling
+            elif not healthy:
+                e.cooling = None
+
+    def engine_for_model(self, model_key: str, required_cap: Optional[str] = None) -> Optional[Engine]:
+        """Return a healthy, capability-compatible engine that already has the model
+        resident, if any.
+
+        Matching is forgiving: exact key, short-name, or type-prefixed variants —
+        the same fuzzy spirit the manager uses, but read-only over loaded keys."""
+        if not model_key:
+            return None
+        short = model_key.split("/")[-1]
+        with self._lock:
+            for e in self._engines.values():
+                if not e.healthy or not e.can_serve(required_cap):
+                    continue
+                for k in e.loaded_models:
+                    if k == model_key or k.split("/")[-1] == short \
+                            or k.endswith(model_key) or model_key.endswith(k.split(":")[-1]):
+                        return e
+        return None
+
+    def engine_for_assigned(self, model_key: str) -> Optional[Engine]:
+        """The engine the front ASSIGNED this model to (single owner), or None.
+
+        The assignment is the authoritative routing decision (it already encodes
+        pins, the default engine, and balanced auto-selection); match leniently so a
+        short-name / alias resolves to the owner."""
+        if not model_key:
+            return None
+        short = model_key.split("/")[-1]
+        with self._lock:
+            for e in self._engines.values():
+                if not e.healthy:
+                    continue
+                for k in e.assigned_models:
+                    if (k == model_key or k.split("/")[-1] == short
+                            or k.endswith(model_key) or model_key.endswith(k.split("/")[-1])):
+                        return e
+        return None
+
+    def least_loaded(self, required_cap: Optional[str] = None) -> Optional[Engine]:
+        """Pick a healthy, capability-compatible engine to load a new model on:
+        fewest resident models, then most free VRAM."""
+        with self._lock:
+            cands = [e for e in self._engines.values()
+                     if e.healthy and e.can_serve(required_cap)]
+        if not cands:
+            return None
+
+        def _free(e: Engine) -> float:
+            return (e.vram or {}).get("free", 0.0) if e.vram else 0.0
+
+        cands.sort(key=lambda e: (len(e.loaded_models), -_free(e)))
+        return cands[0]
--- a/codai/frontproxy/router.py
+++ b/codai/frontproxy/router.py
--- a/codai/main.py
+++ b/codai/main.py
--- a/codai/models/capabilities.py
+++ b/codai/models/capabilities.py
@@ -21,6 +21,7 @@ from threading import Lock
 from typing import List, Optional
 import json
 import os
+import re
 import time


@@ -179,11 +180,15 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
        return caps

    # ── Image: upscaling (checked before general SD rule to catch SD-family upscalers) ──
-    if any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
-                              'bsrgan', 'hat-', 'dat-',
+    # 'hat-'/'dat-' are short, ambiguous tokens (e.g. they appear inside
+    # "chat-", "update-"); require a word boundary before them so a text "chat"
+    # model isn't mistaken for the HAT/DAT super-resolution checkpoints.
+    if (any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
+                              'bsrgan',
                              'x2-upscaler', 'x4-upscaler', 'x2_upscaler', 'x4_upscaler',
                              'latent-upscaler', 'latent_upscaler',
-                              'ldm-super-resolution', 'rcan-', 'sr3-']):
+                              'ldm-super-resolution', 'rcan-', 'sr3-'])
+            or re.search(r'\b[hd]at-', n)):
        caps.image_upscaling = True
        caps.image_to_image = True
        return caps

--- a/codai/models/manager.py
+++ b/codai/models/manager.py
--- a/codai/models/parser.py
+++ b/codai/models/parser.py
--- a/codai/models/ram_monitor.py
+++ b/codai/models/ram_monitor.py
--- a/codai/models/thermal.py
+++ b/codai/models/thermal.py
--- a/commands
+++ b/commands
+python tools/video_editor.py --no-browser --host 0.0.0.0 --media-dir tools/coderai_media --session
+tools/gen_township_fighters.py -c township_output/township_config.json
+
--- a/docs/deepseek-ds4.md
+++ b/docs/deepseek-ds4.md
--- a/docs/expressive-tts.md
+++ b/docs/expressive-tts.md
+# Expressive TTS (emotion / delivery)
+
+The video editor shows **Emotion** and **Delivery** dropdowns whenever the
+configured TTS model advertises them (`codai/api/tts_backends.py`:
+`family_emotions` / `family_styles`). Two engines support expressive control.
+
+## Bark — in-stack, no extra deps
+
+Works with the server's current `transformers`. Configure a Bark model as the
+TTS model, e.g. `--tts-model suno/bark` (or `suno/bark-small`).
+
+- **Delivery**: `normal`, `whispering` (`[whispers] …`), `singing` (`♪ … ♪`),
+  `emphasis` (UPPERCASE).
+- **Emotion**: inserts a matching non-verbal cue — `laughter`→`[laughs]`,
+  `sigh`→`[sighs]`, `gasp`→`[gasps]`.
+- **Voice**: a Bark preset like `v2/en_speaker_6`. The editor's Kokoro voice ids
+  don't apply and fall back to the default preset (set `voice_preset` in the
+  model config to change it). Speed isn't controllable in Bark.
+
+## Parler — fully managed by coderai (no setup)
+
+`parler-tts` pins an old `transformers`/`tokenizers`/`huggingface-hub` that
+**conflict with this server** — never `pip install` it into the coderai venv.
+coderai handles this for you: just use a Parler model as the TTS model
+(e.g. `parler-tts/parler-tts-mini-multilingual`). The worker is launched lazily —
+only when a request for that model actually arrives — and shut down when the
+model is evicted, exactly like loading/unloading any other model. On first use it
+
+1. creates a dedicated venv at `~/.coderai/parler_venv`
+   (override with `CODERAI_PARLER_VENV`), built `--system-site-packages` so the
+   base torch/numpy are reused and only the conflicting packages land in it;
+2. `pip install`s parler-tts there;
+3. launches `tools/parler_tts_service.py` in that venv on a local port, pointing
+   `HF_HUB_CACHE` at coderai's own cache and forcing **offline mode**
+   (`HF_HUB_OFFLINE=1`) so it loads strictly the model you **already downloaded
+   via the model interface** — the worker never downloads anything itself;
+4. health-checks it and routes synthesis to it.
+
+The worker is owned by `codai/api/parler_worker.py`; the backend's `cleanup()`
+calls `stop_service()`, so the model manager's normal eviction tears the process
+down. The first request blocks while the venv builds, then it's cached.
+
+If the model isn't in coderai's cache, the worker fails fast with a clear error
+("download '<model>' from the model interface first") instead of fetching it.
+Download the Parler model through the normal HF download UI first.
+
+The editor's **Emotion**/**Delivery** dropdowns drive it: coderai POSTs
+`{text, voice, speed, emotion, style}` to the worker, which maps them into a
+natural-language delivery description (whisper / shout / monotone / expressive +
+emotion + pace). A fixed `description` in the model config overrides the
+auto-built one. An explicit `service_url` in the config bypasses management and
+talks to an externally-run service instead.
+
+> The model must still be in the server's allowed-models registry to be
+> selectable — that's the only configuration; the worker itself needs none.
--- a/docs/frontend-engine-split.md
+++ b/docs/frontend-engine-split.md
--- a/docs/process-isolation-plans.md
+++ b/docs/process-isolation-plans.md
--- a/docs/reverse-proxy-nginx.md
+++ b/docs/reverse-proxy-nginx.md
--- a/packaging/linux/build_oci_image.sh
+++ b/packaging/linux/build_oci_image.sh
--- a/tools/parler_tts_service.py
+++ b/tools/parler_tts_service.py
--- a/tools/video_editor.py
+++ b/tools/video_editor.py
--- a/video_editor.config.json
+++ b/video_editor.config.json