front/engine split, ds4 + media tooling, gemma-4 native tools; ignore runtime artifacts

- frontproxy: torch-free front proxy + per-vendor engine supervisor with auth,
  localhost binding, model routing; Ctrl-C now force-kills engines (own session +
  PDEATHSIG, SIGKILL of engine process groups, watchdog on hung drain)
- gemma-4 tool calling: prompt via native tools= template, parse call:NAME{...}
  into tool_calls, honour generation_config EOS so it stops instead of looping
- ds4 external worker, parler/expressive TTS backends, video editor tooling
- --debug-requests: full client<->API request/response logging + live snapshots
- stop tracking runtime artifacts (video_editor/sessions/, tools/coderai_media/)
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent 2fb085f4
......@@ -33,3 +33,7 @@ township_output/
# Packaging build cache + runtime temp (large artifacts)
.packaging-cache/
tmp/
# Video editor sessions + generated media (runtime artifacts)
video_editor/sessions/
tools/coderai_media/
......@@ -35,6 +35,7 @@ BACKEND="${1:-all}"
FLASH=false
CUSTOM_VENV=""
PACKAGE=false
DS4=false
# Parse arguments
i=1
......@@ -50,6 +51,9 @@ for arg in "$@"; do
--package)
PACKAGE=true
;;
--ds4)
DS4=true
;;
esac
i=$((i + 1))
done
......@@ -68,6 +72,7 @@ if [[ "$BACKEND" != "nvidia" && "$BACKEND" != "vulkan" && "$BACKEND" != "vulkan-
echo ""
echo "Options:"
echo " --flash - Install Flash Attention 2 for faster inference (NVIDIA only)"
echo " --ds4 - Clone + build the ds4 (DeepSeek V4) native engine"
exit 1
fi
......@@ -755,6 +760,35 @@ package_app() {
echo -e "${YELLOW}Note: The target machine must still provide compatible system GPU/runtime libraries.${NC}"
}
# Optionally clone + build ds4 (DeepSeek V4 native engine). Opt-in via --ds4.
# coderai can also auto-build this at runtime on first use, but doing it here lets
# the OCI/Docker packaging bundle the prebuilt ds4-server binary.
build_ds4() {
local DS4_DIR="${CODERAI_DS4_DIR:-$HOME/.coderai/ds4}"
echo -e "${YELLOW}Building ds4 (DeepSeek V4 engine) → $DS4_DIR ...${NC}"
if [ ! -e "$DS4_DIR/Makefile" ]; then
mkdir -p "$(dirname "$DS4_DIR")"
git clone --depth 1 https://github.com/antirez/ds4 "$DS4_DIR" || {
echo -e "${YELLOW}Warning: could not clone ds4; skipping.${NC}"; return 0; }
fi
local TARGET="cpu"
if command -v nvcc &> /dev/null || [ -d "/usr/local/cuda" ]; then
TARGET="cuda-generic"
elif [ "$(uname -s)" = "Darwin" ]; then
TARGET="" # bare `make` builds the macOS Metal backend
fi
( cd "$DS4_DIR" && make $TARGET ) || {
echo -e "${YELLOW}Warning: ds4 build failed; it can still be built at runtime.${NC}"; return 0; }
if [ -x "$DS4_DIR/ds4-server" ]; then
echo -e "${GREEN}✓ ds4-server built at $DS4_DIR/ds4-server${NC}"
echo -e "${YELLOW}Note: DeepSeek V4 weights are downloaded on first use (multi-GB).${NC}"
fi
}
if [ "$DS4" = true ]; then
build_ds4
fi
# Create .backend file to track which backend was used
echo "$BACKEND" > .backend
......
This diff is collapsed.
......@@ -2372,7 +2372,7 @@ const STUDIO_CAPABILITIES = {
optional:[],
notes:[
'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
'No AI model selection needed — this feature uses its own dedicated backend.',
],
backendPath: ROOT_PATH + '/v1/images/faceswap',
......@@ -2386,7 +2386,7 @@ const STUDIO_CAPABILITIES = {
optional:[],
notes:[
'Requires <code>insightface</code> and <code>onnxruntime</code>: <code>pip install insightface onnxruntime</code>.',
'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
'The <b>inswapper_128.onnx</b> model is <b>auto-downloaded</b> from HuggingFace on first use (<a href="' + (window.ROOT_PATH||'') + '/admin/models?tab=search&q=inswapper&pipeline=&gguf=no-gguf" class="cap-find-link">deepinsight/inswapper<span class="cap-find-icon">↗</span></a>).',
'No AI model selection needed — this feature uses its own dedicated backend.',
],
backendPath: ROOT_PATH + '/v1/images/faceswap',
......@@ -2461,14 +2461,14 @@ function capSearchUrl(cap) {
const s = CAP_TO_HF_SEARCH[cap];
if (!s) return null;
const p = new URLSearchParams({ tab:'search', q: s.q, pipeline: s.pipeline, gguf: s.gguf });
return '/admin/models?' + p.toString();
return (window.ROOT_PATH || '') + '/admin/models?' + p.toString();
}
function capMissingHtml(caps, label) {
if (!caps.length) return '';
const links = caps.map(cap => {
const chip = `<span class="cap-chip dim">${cap.replace(/_/g,' ')}</span>`;
if (_localCapSet.has(cap)) {
const url = `/admin/models?local_cap=${encodeURIComponent(cap)}`;
const url = `${window.ROOT_PATH || ''}/admin/models?local_cap=${encodeURIComponent(cap)}`;
return `<a href="${url}" class="cap-find-link" title="You have a local model with ${cap.replace(/_/g,' ')} — click to configure it">${chip}<span class="cap-find-icon" style="color:#6ecf7e">↑ configure</span></a>`;
}
const url = capSearchUrl(cap);
......
......@@ -577,6 +577,13 @@ window.__DEFAULT_WHISPER_SERVER_PATH__ = {{ default_whisper_server_path|tojson }
</select>
</div>
</div>
<div class="form-row" id="cfg-engine-row" style="margin-top:.75rem;display:none">
<label class="form-label">Engine / card</label>
<select id="cfg-engine" class="form-input">
<option value="">Default (auto — by capability)</option>
</select>
<span class="form-hint" style="font-size:11px">Pin this model to a specific engine/card. Overrides the default engine. Only shown when multiple engines are running.</span>
</div>
<div style="display:grid;grid-template-columns:1fr 1fr;gap:.75rem;margin-top:.75rem">
<div class="form-row" style="margin:0">
<label class="form-label">Used VRAM <span class="muted">(GB)</span></label>
......@@ -1441,8 +1448,7 @@ function handleProgressEvent(evt){
showDownloadError(evt.message);
}else if(evt.type==='cancelled'){
_dlDone=true;
if(_dlEs){_dlEs.close();_dlEs=null;}
showDownloadError('Download cancelled');
showDownloadCancelled();
}
// keepalive: ignore
}
......@@ -1483,7 +1489,7 @@ async function reopenDownload(session_id){
if(s.rate) document.getElementById('dl-speed').textContent=fmtRate(s.rate);
if(s.eta!=null) document.getElementById('dl-eta').textContent=fmtEta(s.eta);
if(s.status==='done'){handleProgressEvent({type:'done'});return;}
if(s.status==='cancelled'){showDownloadError('Download cancelled');return;}
if(s.status==='cancelled'){_dlDone=true;showDownloadCancelled();return;}
if(s.status==='error'){showDownloadError(s.error||'Download failed');return;}
}
}
......@@ -1501,15 +1507,27 @@ async function reopenDownload(session_id){
};
}
function showDownloadCancelled(){
if(_dlEs){_dlEs.close();_dlEs=null}
document.getElementById('dl-form').style.display='block';
document.getElementById('dl-progress').style.display='none';
}
async function stopDownload(session_id){
if(!confirm('Cancel this download?')) return;
try{
await fetch(ROOT_PATH + '/admin/api/download-cancel/'+session_id, {method:'POST'});
const r = await fetch(ROOT_PATH + '/admin/api/download-cancel/'+session_id, {method:'POST'});
if(!r.ok){
let detail = r.status+' '+r.statusText;
try{ const j = await r.json(); if(j&&j.detail) detail = j.detail; }catch{}
alert('Could not cancel download: '+detail);
return;
}
if(_dlSessionId===session_id){
if(_dlEs){_dlEs.close();_dlEs=null;}
_dlDone=true;
showDownloadError('Download cancelled');
showDownloadCancelled();
}
pollDownloads(); // refresh the active-downloads strip immediately
}catch(e){
alert('Could not cancel download: '+e.message);
}
......@@ -1798,12 +1816,53 @@ let _localModels = [];
let _ggufFiles = [];
let _hfModels = [];
// Engine/card hardware info (fetched once); used to tag models with the card they
// run on when more than one engine is configured.
let _engineNames = [];
let _defaultEngine = '';
async function _loadEngineInfo(){
// Live engine names from the front (covers auto-detected engines, not just those
// declared in engine_specs); default_engine still comes from settings.
try {
const er = await fetch(ROOT_PATH + '/admin/api/engines');
if (er.ok) _engineNames = ((await er.json()).engines || []).map(e => e.name);
} catch(e) {}
try {
const d = await (await fetch(ROOT_PATH + '/admin/api/settings')).json();
if (!_engineNames.length) _engineNames = (d.server && d.server.engine_names) || [];
_defaultEngine = (d.server && d.server.default_engine) || '';
} catch(e) {}
}
// Compact card tag for a model config. Pinned engines show as-is (with 📌);
// otherwise the engine is inferred from the model's format (transformers/ds4 →
// nvidia; gguf/whisper → the default engine, or "any"). Hidden when ≤1 engine, so
// it never widens single-card setups.
function _engineTagHtml(m, s){
if(!_engineNames || _engineNames.length < 2) return '';
let eng = ((s && s.engine) || '').trim();
let pinned = !!eng;
if(!eng){
const path = (((m && (m.path || m.id || m.filename)) || '') + '').toLowerCase();
const isGguf = path.endsWith('.gguf') || path.includes('gguf');
const isWhisper = ((s && s.backend) || '') === 'whisper-server';
const isDs4 = path.includes('deepseek-v4');
if(isDs4 || (!isGguf && !isWhisper)) eng = 'nvidia'; // ds4/transformers → nvidia
else eng = _defaultEngine || 'any'; // gguf/whisper → default
}
const lc = eng.toLowerCase();
const color = (lc.includes('nv')) ? '#76b900'
: (lc.includes('rad') || lc.includes('amd')) ? '#ed1c24'
: 'var(--text-3)';
const title = pinned ? ('Pinned to engine: ' + eng) : ('Runs on: ' + eng + ' (auto)');
return `<span class="badge" title="${esc(title)}" style="font-size:9px;padding:.05rem .3rem;margin:.1rem .1rem 0 0;vertical-align:middle;border:1px solid ${color};color:${color};background:transparent">${esc(eng)}${pinned?' 📌':''}</span>`;
}
function _renderConfigPills(idx, m) {
const configs = m.configs || [];
if (!configs.length) return '';
const pills = configs.map((c, cfgIdx) => {
const label = (c.settings && (c.settings.config_name || c.settings.alias)) || `Config ${cfgIdx + 1}`;
return `<span class="badge badge-user" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem .1rem 0 0" onclick="openCfgModal(${idx},${cfgIdx})" title="Edit this configuration">${esc(label)}</span>`;
return `<span class="badge badge-user" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem .1rem 0 0" onclick="openCfgModal(${idx},${cfgIdx})" title="Edit this configuration">${esc(label)}</span>${_engineTagHtml(m, c.settings)}`;
}).join('');
const addPill = `<span class="badge" style="font-size:10px;cursor:pointer;vertical-align:middle;margin:.1rem 0 0 0;background:var(--raised);border:1px dashed var(--border);color:var(--text-2)" onclick="openCfgModalNew(${idx})" title="Add another configuration for this model">+ Config</span>`;
return `<br style="line-height:.5rem">${pills}${addPill}`;
......@@ -2338,6 +2397,9 @@ async function refreshLocal(){
}
loadGlobalSettings();
// Load engine/card info first so the per-model card tags render on the first paint,
// then re-render once it's available (covers the fetch resolving after the list).
_loadEngineInfo().then(() => loadCachedModels());
refreshLocal();
// Toggle the acceleration / TurboQuant sections as model types are checked/unchecked.
......@@ -2731,6 +2793,7 @@ function openCfgModal(idx, cfgIdx){
document.getElementById('cfg-noram').checked = !!s.no_ram;
document.getElementById('cfg-offload-strategy').value = s.offload_strategy || 'auto';
document.getElementById('cfg-offload-dir').value = s.offload_dir || _defaultOffloadDir;
_populateEnginePin(s.engine || '');
document.getElementById('cfg-sysprompt').value = s.system_prompt || '';
document.getElementById('cfg-parser').value = s.parser || (!m.in_config ? _autoDetectParser(m.path) : 'auto');
document.getElementById('cfg-tools').checked = !!s.tools_closer_prompt;
......@@ -3027,6 +3090,21 @@ async function removeThisConfig(){
} catch(e) { alert('Error: ' + e.message); }
}
// Engine-pin field: populate the datalist from declared engines and only show the
// row when more than one engine is configured (single-engine setups don't need it).
async function _populateEnginePin(desired){
const row = document.getElementById('cfg-engine-row');
const sel = document.getElementById('cfg-engine');
try {
if (!_engineNames || !_engineNames.length) await _loadEngineInfo();
const want = (desired !== undefined) ? desired : sel.value;
sel.querySelectorAll('option:not([value=""])').forEach(o => o.remove());
_engineNames.forEach(n => { const o=document.createElement('option'); o.value=n; o.textContent=n; sel.appendChild(o); });
sel.value = want || ''; // set AFTER options exist so the selection sticks
row.style.display = _engineNames.length > 1 ? '' : 'none';
} catch(e) { row.style.display = 'none'; }
}
async function saveModelConfig(){
const path = document.getElementById('cfg-path').value;
const maxGpu = parseFloat(document.getElementById('cfg-max-gpu').value);
......@@ -3063,6 +3141,7 @@ async function saveModelConfig(){
no_ram: document.getElementById('cfg-noram').checked,
offload_strategy: document.getElementById('cfg-offload-strategy').value,
offload_dir: document.getElementById('cfg-offload-dir').value.trim() || './offload',
engine: document.getElementById('cfg-engine').value.trim() || null,
system_prompt: document.getElementById('cfg-sysprompt').value.trim() || null,
parser: document.getElementById('cfg-parser').value,
tools_closer_prompt: document.getElementById('cfg-tools').checked,
......@@ -3094,7 +3173,12 @@ async function saveModelConfig(){
body: JSON.stringify(data)
});
const d = await r.json();
if(d.success){ closeModal('cfg-modal'); loadCachedModels(); }
if(d.success){
if (d.warnings && d.warnings.length) {
alert('Saved, but check this:\n\n• ' + d.warnings.join('\n• '));
}
closeModal('cfg-modal'); loadCachedModels();
}
else alert('Error: '+(d.detail||'Unknown'));
}catch(e){ alert('Error: '+e.message); }
}
......
This diff is collapsed.
......@@ -30,11 +30,23 @@
<div id="sys-stats" style="display:grid;grid-template-columns:repeat(auto-fit,minmax(220px,1fr));
gap:.75rem;margin:0 0 1.25rem">
<div class="sys-tile" id="tile-cpu"></div>
<div class="sys-tile" id="tile-gpu"></div>
<div class="sys-tile" id="tile-ram"></div>
<!-- Per-card GPU tiles (util + VRAM) injected here when cards are detected. -->
<div id="tile-cards" style="display:contents"></div>
<!-- Fallback single tiles when per-card stats are unavailable. -->
<div class="sys-tile" id="tile-gpu"></div>
<div class="sys-tile" id="tile-vram"></div>
</div>
<!-- Engines (only shown in front/multi-engine mode) -->
<div id="engines-card" style="display:none;margin:0 0 1.25rem">
<div style="display:flex;align-items:baseline;gap:.5rem;margin-bottom:.5rem">
<h2 style="font-size:14px;margin:0">Engines</h2>
<span class="dim small">restart a stuck engine — the supervisor respawns it</span>
</div>
<div id="engines-body" style="display:grid;grid-template-columns:repeat(auto-fit,minmax(240px,1fr));gap:.6rem"></div>
</div>
<style>
.sys-tile{border:1px solid var(--border,#2a2a2a);border-radius:10px;padding:.7rem .85rem;
background:var(--card-bg,rgba(255,255,255,.02))}
......@@ -76,7 +88,7 @@ function fmtTime(s) {
} catch { return ''; }
}
const KIND_LABEL = {training:'Training', image:'Image', video:'Video', upscale:'Upscale', interpolate:'Interpolate', audio:'Audio', text:'Text', pipeline:'Pipeline', request:'Request', loading:'Loading'};
const KIND_LABEL = {training:'Training', image:'Image', video:'Video', upscale:'Upscale', interpolate:'Interpolate', audio:'Audio', text:'Text', tts:'Speech (TTS)', transcription:'Transcription', embedding:'Embedding', spatial:'3D / Spatial', pipeline:'Pipeline', request:'Request', loading:'Loading'};
const STATUS_BADGE = {
running:'badge-admin', queued:'badge-user', done:'badge-ok', error:'badge-err',
cancelled:'badge-user', interrupted:'badge-warn'
......@@ -140,18 +152,89 @@ function _memTile(name, used, total, pct){
return `<div class="sys-head"><span class="sys-name">${name}</span><span class="sys-val">${valTxt}</span></div>`
+ _bar(p) + `<div class="sys-sub"><span>${p == null ? '' : Math.round(p)+'% used'}</span><span></span></div>`;
}
// One tile per physical card showing both GPU utilization and VRAM (+ temp).
function _cardTile(c){
const vColor = c.vendor==='nvidia' ? '#76b900'
: c.vendor==='amd' ? '#ed1c24' : 'var(--text-3)';
const memP = (c.mem_total ? (c.mem_used / c.mem_total * 100) : null);
const temp = (c.temp!=null) ? ' · '+Math.round(c.temp)+'°C' : '';
const util = (c.util!=null) ? Math.round(c.util)+'%' : '—';
return `<div class="sys-tile">
<div class="sys-head"><span class="sys-name" style="color:${vColor}">${esc(c.name)}</span>
<span class="sys-val">${util}${temp}</span></div>
${_bar(c.util)}
<div class="sys-sub"><span>VRAM ${c.mem_used!=null?c.mem_used.toFixed(1):'—'}/${c.mem_total!=null?c.mem_total.toFixed(0):'—'} GB</span>
<span>${memP!=null?Math.round(memP)+'% used':''}</span></div>
${_bar(memP)}
</div>`;
}
async function loadSystemStats(){
try {
const s = await fetch(ROOT_PATH + '/admin/api/system-stats').then(r => r.json());
const cpu = s.cpu || {}, gpu = s.gpu || {}, ram = s.ram || {}, vram = s.vram || {};
document.getElementById('tile-cpu').innerHTML = _utilTile('CPU', cpu.util, cpu.temp, (cpu.cores || 1) * 100);
document.getElementById('tile-gpu').innerHTML = _utilTile('GPU', gpu.util, gpu.temp);
document.getElementById('tile-ram').innerHTML = _memTile('RAM', ram.used, ram.total, ram.percent);
document.getElementById('tile-vram').innerHTML =
_memTile('VRAM', vram.used, vram.total, vram.percent);
// Per-card GPU+VRAM tiles for every physical card; fall back to single tiles.
let cards = [];
try { cards = ((await fetch(ROOT_PATH + '/admin/api/gpu-stats').then(r => r.json())).cards) || []; } catch(e){}
const cardsEl = document.getElementById('tile-cards');
const gpuEl = document.getElementById('tile-gpu');
const vramEl = document.getElementById('tile-vram');
if (cards.length) {
cardsEl.innerHTML = cards.map(_cardTile).join('');
gpuEl.style.display = 'none'; vramEl.style.display = 'none';
} else {
cardsEl.innerHTML = '';
gpuEl.style.display = ''; vramEl.style.display = '';
gpuEl.innerHTML = _utilTile('GPU', gpu.util, gpu.temp);
vramEl.innerHTML = _memTile('VRAM', vram.used, vram.total, vram.percent);
}
} catch(e){ /* keep last render on transient errors */ }
}
// Engines panel — only present in front/multi-engine mode (404 in single-process).
async function loadEngines(){
let engines = null;
try {
const r = await fetch(ROOT_PATH + '/admin/api/engines');
if (!r.ok) { document.getElementById('engines-card').style.display = 'none'; return; }
engines = (await r.json()).engines || [];
} catch(e) { document.getElementById('engines-card').style.display = 'none'; return; }
const card = document.getElementById('engines-card');
if (!engines.length) { card.style.display = 'none'; return; }
card.style.display = '';
document.getElementById('engines-body').innerHTML = engines.map(e => {
const dot = e.healthy ? '#3fb950' : '#e5534b';
const state = e.healthy ? 'healthy' : 'down / starting';
const vram = e.vram ? `${(e.vram.used ?? 0).toFixed ? e.vram.used.toFixed(1) : e.vram.used}/${e.vram.total} GB` : '';
const cool = e.cooling ? ` <span class="badge badge-warn" style="font-size:9px">❄ cooling</span>` : '';
const prim = e.primary ? ` <span class="badge badge-user" style="font-size:9px">primary</span>` : '';
const models = (e.loaded_models||[]).length;
return `<div class="sys-tile">
<div class="sys-head">
<span class="sys-name">${esc(e.name)} <span class="dim" style="text-transform:none">(${esc(e.backend)})</span>${prim}${cool}</span>
<span style="width:9px;height:9px;border-radius:50%;background:${dot};display:inline-block" title="${state}"></span>
</div>
<div class="sys-sub"><span>${esc(state)}${vram?' · '+esc(vram):''}</span><span>${models} model${models!==1?'s':''}</span></div>
<div style="margin-top:.5rem;text-align:right">
<button class="btn btn-ghost" style="font-size:11px;padding:.15rem .5rem;color:var(--error,#e55)"
onclick="restartEngine(${e.id}, '${esc(e.name)}')" title="Kill and respawn this engine">↻ Restart</button>
</div>
</div>`;
}).join('');
}
async function restartEngine(id, name){
if (!confirm(`Restart engine "${name}"? In-flight requests on it will fail; the supervisor respawns it immediately.`)) return;
try {
const r = await fetch(ROOT_PATH + '/admin/api/engines/' + id + '/restart', {method:'POST'});
if (!r.ok) { const e = await r.json().catch(()=>({})); alert(e.detail || 'Restart failed'); }
setTimeout(loadEngines, 800);
} catch(e) { alert(e.message); }
}
let _refreshing = false;
async function loadTasks() {
if (_refreshing) return;
......@@ -165,7 +248,19 @@ async function loadTasks() {
const therm = data.thermal || {};
const banner = document.getElementById('thermal-banner');
if (therm.active) {
// Multi-engine: name which engine(s) are cooling and on what (GPU vs CPU).
const cooling = data.cooling_engines || [];
if (cooling.length) {
const parts = cooling.map(c => {
const what = (c.gpu != null && c.cpu == null) ? `GPU ${Math.round(c.gpu)}°C`
: (c.cpu != null && c.gpu == null) ? `CPU ${Math.round(c.cpu)}°C`
: (c.message || 'cooling');
return `${esc(c.engine)} (${esc(what)})`;
});
document.getElementById('thermal-banner-msg').textContent =
' Cooling down: ' + parts.join(', ');
banner.style.display = '';
} else if (therm.active) {
document.getElementById('thermal-banner-msg').textContent = ' ' + (therm.message || '');
banner.style.display = '';
} else {
......@@ -207,7 +302,7 @@ async function loadTasks() {
}
return `<tr>
<td><span class="badge badge-user">${esc(KIND_LABEL[t.kind] || t.kind)}</span></td>
<td><div class="td-name">${esc(title)}</div><div class="dim small mono">${esc(t.model || '')}</div></td>
<td><div class="td-name">${esc(title)}${t.engine?` <span class="badge badge-user" style="font-size:9px;padding:.05rem .3rem;vertical-align:middle" title="Running on engine">${esc(t.engine)}</span>`:''}</div><div class="dim small mono">${esc(t.model || '')}</div></td>
<td>${statusCell}</td>
<td>${progressBar(t)}</td>
<td class="dim small">${fmtTime(t.started_at)}</td>
......@@ -248,7 +343,9 @@ async function removeTask(id) {
loadTasks();
loadSystemStats();
loadEngines();
setInterval(loadTasks, 2000);
setInterval(loadSystemStats, 2000);
setInterval(loadEngines, 5000);
</script>
{% endblock %}
......@@ -160,6 +160,32 @@ except ImportError:
pass
class _InternalAuthMiddleware:
"""Reject any HTTP request that doesn't carry the front's internal token.
Active only when CODERAI_INTERNAL_TOKEN is set (i.e. this process is an engine
spawned by the front). It binds 127.0.0.1, but this also blocks anything else on
localhost from talking to the engine directly and bypassing the front. In
single-process mode the token is unset and this is a no-op."""
def __init__(self, app):
self._app = app
self._token = os.environ.get("CODERAI_INTERNAL_TOKEN")
async def __call__(self, scope, receive, send):
if self._token and scope.get("type") == "http":
headers = dict(scope.get("headers", []))
got = headers.get(b"x-coderai-internal", b"").decode("latin-1")
if got != self._token:
await send({"type": "http.response.start", "status": 403,
"headers": [(b"content-type", b"application/json")]})
await send({"type": "http.response.body",
"body": b'{"error":"forbidden: engines are reachable only '
b'through the front proxy"}'})
return
await self._app(scope, receive, send)
class _ForwardedPrefixMiddleware:
"""Populate ASGI root_path from X-Forwarded-Prefix / X-Script-Name headers."""
......@@ -180,6 +206,9 @@ class _ForwardedPrefixMiddleware:
app.add_middleware(_ForwardedPrefixMiddleware)
# Added last → outermost: the internal-token gate runs before anything else, so a
# request without the front's token never reaches a route.
app.add_middleware(_InternalAuthMiddleware)
# Mount static files for admin dashboard
from fastapi.staticfiles import StaticFiles
......@@ -193,6 +222,77 @@ from fastapi.responses import FileResponse, Response as _FaviconResponse
_favicon_path = admin_static_dir / "favicon.ico"
@app.get("/healthz", include_in_schema=False)
async def healthz():
"""Cheap liveness probe that touches no torch/model state.
The front proxy's engine supervisor polls this to distinguish a *slow* engine
(busy loading a model — the event loop may be blocked, so this can be late but
will eventually answer) from a *dead* one (connection refused). It must stay
trivial and dependency-free so it returns the instant the loop is free."""
import os as _os
return {"ok": True, "pid": _os.getpid()}
@app.get("/internal/engine-state", include_in_schema=False)
async def internal_engine_state():
"""Auth-free engine introspection for the front proxy's router/aggregator.
Engines bind 127.0.0.1 only, so this is not publicly reachable. Returns which
models are resident (for model→engine routing) and this engine's GPU/VRAM (for
cross-engine status aggregation). Kept cheap so it answers even mid-generation.
"""
import os as _os
try:
loaded = list(multi_model_manager.models.keys())
except Exception:
loaded = []
vram = None
try:
import torch
if torch.cuda.is_available():
# Sum across every CUDA device this engine can see — an engine may own
# more than one GPU (e.g. two NVIDIA cards sharding one large model), so
# reporting only device 0 would under-count its VRAM.
n = torch.cuda.device_count()
used = free = total = 0
devs = []
for i in range(n):
f, t = torch.cuda.mem_get_info(i)
used += (t - f); free += f; total += t
devs.append({"index": i, "name": torch.cuda.get_device_name(i),
"free": round(f / 1e9, 2), "total": round(t / 1e9, 2)})
label = (torch.cuda.get_device_name(0) if n == 1
else f"{n}× CUDA")
vram = {"used": round(used / 1e9, 2), "free": round(free / 1e9, 2),
"total": round(total / 1e9, 2), "gpu": label,
"devices": devs, "device_count": n}
except Exception:
vram = None
# Running tasks so the front can show cross-engine activity without needing a
# session on this engine (sessions live only on the primary).
tasks = []
try:
from codai.tasks import task_registry
tasks = [t for t in task_registry.list()
if t.get("status") in ("running", "queued", "paused")]
except Exception:
tasks = []
# This engine's thermal cooldown state, so the front can show WHICH engine is
# cooling (each engine pauses on its own GPUs; CPU pauses everything).
cooling = None
try:
from codai.models import thermal
cs = thermal.get_cooldown_state()
if cs.get("active"):
cooling = {"gpu": cs.get("gpu"), "cpu": cs.get("cpu"),
"message": cs.get("message")}
except Exception:
cooling = None
return {"ok": True, "pid": _os.getpid(), "loaded_models": loaded,
"vram": vram, "tasks": tasks, "cooling": cooling}
@app.get("/favicon.ico", include_in_schema=False)
async def favicon():
if _favicon_path.exists():
......
This diff is collapsed.
......@@ -106,6 +106,27 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
"""
OpenAI-compatible embeddings endpoint.
"""
# Register a task so embeddings appear in the unified task list, like every
# other model type. Finished on success or error below.
from codai.tasks import task_registry
_title = request.input if isinstance(request.input, str) else "embeddings"
_tid = task_registry.register(
"embedding", title=str(_title)[:80], model=(request.model or "embedding"))
task_registry.start(_tid)
try:
_resp = await _run_embeddings(request, http_request)
task_registry.finish(_tid, "done")
return _resp
except HTTPException:
task_registry.finish(_tid, "error")
raise
except Exception as e:
task_registry.finish(_tid, "error", str(e)[:200])
raise
async def _run_embeddings(request: EmbeddingsRequest, http_request: Request = None):
"""Core embeddings logic; registered as a task by create_embeddings()."""
model_info = await asyncio.to_thread(
multi_model_manager.request_model, request.model, model_type="embedding")
model_name = model_info.get('model_name')
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
"""Fully-managed Parler-TTS worker.
parler-tts pins an old transformers/tokenizers/huggingface-hub that conflict with
the coderai server's stack, so it can't share this venv. Instead coderai owns the
whole lifecycle here: on first use it bootstraps a dedicated venv (installing
parler-tts), launches ``tools/parler_tts_service.py`` in it as a local HTTP
service, health-checks it, and hands back the URL. The matching
``_RemoteParlerBackend.cleanup()`` calls :func:`stop_service`, so the model
manager's normal eviction tears the process down — no manual setup or config.
"""
import os
import socket
import subprocess
import sys
import threading
import time
from pathlib import Path
_REPO_ROOT = Path(__file__).resolve().parents[2]
_SERVICE_SCRIPT = _REPO_ROOT / "tools" / "parler_tts_service.py"
# Dedicated venv for the (incompatible) parler-tts stack. Created with access to
# the base interpreter's packages so torch/numpy aren't re-downloaded; parler's
# pinned transformers installs into the venv and shadows the system one.
_VENV_DIR = Path(os.environ.get("CODERAI_PARLER_VENV")
or os.path.expanduser("~/.coderai/parler_venv"))
_lock = threading.RLock()
_services: dict[str, dict] = {} # model_name -> {"proc","port","url"}
_bootstrapped = False
def _venv_python() -> Path:
return _VENV_DIR / ("Scripts" if os.name == "nt" else "bin") / (
"python.exe" if os.name == "nt" else "python")
def _pip_ok(py: Path) -> bool:
try:
return subprocess.run([str(py), "-c", "import parler_tts, soundfile"],
capture_output=True).returncode == 0
except Exception:
return False
def _venv_is_system_site() -> bool:
"""True if the venv was built with --system-site-packages (can't isolate)."""
try:
return "include-system-site-packages = true" in \
(_VENV_DIR / "pyvenv.cfg").read_text().lower()
except Exception:
return False
def _bootstrap_venv() -> Path:
"""Create a fully-isolated venv and install parler-tts (idempotent).
Isolation is the whole point: parler-tts pins an old transformers/tokenizers
that must NOT be shared with — or shadowed by — the server's stack, so the
venv gets its own copy of everything (torch included). Returns its python."""
global _bootstrapped
py = _venv_python()
if _bootstrapped and py.exists():
return py
# A previously-created shared-site venv leaks the server's transformers in;
# rebuild it isolated.
if py.exists() and _venv_is_system_site():
import shutil
print("[parler] rebuilding venv as fully isolated …", flush=True)
shutil.rmtree(_VENV_DIR, ignore_errors=True)
if not _venv_python().exists():
print(f"[parler] creating isolated venv at {_VENV_DIR} …", flush=True)
_VENV_DIR.parent.mkdir(parents=True, exist_ok=True)
subprocess.run([sys.executable, "-m", "venv", str(_VENV_DIR)], check=True)
py = _venv_python()
if not _pip_ok(py):
print("[parler] installing parler-tts + torch into the isolated venv "
"(first run, downloads several GB, this can take a while) …", flush=True)
subprocess.run([str(py), "-m", "pip", "install",
"git+https://github.com/huggingface/parler-tts.git",
"soundfile"], check=True)
if not _pip_ok(py):
raise RuntimeError("parler-tts install did not yield an importable package")
_bootstrapped = True
return py
def _free_port() -> int:
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("127.0.0.1", 0))
port = s.getsockname()[1]
s.close()
return port
def _pump_logs(proc: subprocess.Popen, tail):
for line in proc.stdout:
line = line.rstrip()
if line:
tail.append(line)
print(f"[parler] {line}", flush=True)
def _health_ok(url: str) -> bool:
import requests
try:
r = requests.get(url + "/health", timeout=3)
return r.ok and bool(r.json().get("ok"))
except Exception:
return False
def ensure_service(model_name: str, ready_timeout: float = 1800.0) -> str:
"""Start (or reuse) the worker for ``model_name`` and return its base URL.
First call bootstraps the venv and downloads the model, so the timeout is
generous. Raises RuntimeError if the service never comes up."""
with _lock:
svc = _services.get(model_name)
if svc and svc["proc"].poll() is None and _health_ok(svc["url"]):
return svc["url"]
if svc and svc["proc"].poll() is not None:
_services.pop(model_name, None) # died — restart below
py = _bootstrap_venv()
port = _free_port()
url = f"http://127.0.0.1:{port}"
env = dict(os.environ)
# The worker must use the model already pulled via coderai's HF download
# interface — it never downloads anything itself. Point it at coderai's
# cache and force offline mode, so a missing model fails fast instead of
# silently fetching.
try:
from codai.models.cache import get_hf_hub_cache_dir
hub = get_hf_hub_cache_dir()
env["HF_HUB_CACHE"] = hub
env["HUGGINGFACE_HUB_CACHE"] = hub
except Exception:
pass
env["HF_HUB_OFFLINE"] = "1"
env["TRANSFORMERS_OFFLINE"] = "1"
proc = subprocess.Popen(
[str(py), str(_SERVICE_SCRIPT), "--model", model_name,
"--host", "127.0.0.1", "--port", str(port)],
stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True,
bufsize=1, env=env, cwd=str(_REPO_ROOT),
)
import collections
tail = collections.deque(maxlen=15)
threading.Thread(target=_pump_logs, args=(proc, tail), daemon=True).start()
_services[model_name] = {"proc": proc, "port": port, "url": url}
def _tail_msg():
joined = " | ".join(list(tail)[-5:]).strip()
if "offline" in joined.lower() or "not" in joined.lower() and "found" in joined.lower():
return (f". The model isn't in coderai's cache — download "
f"'{model_name}' from the model interface first. ({joined})")
return f". Last output: {joined}" if joined else ""
# Wait (outside the lock) for the service to load the model and answer.
deadline = time.time() + ready_timeout
while time.time() < deadline:
if proc.poll() is not None:
raise RuntimeError(
f"Parler worker exited (code {proc.returncode}) before becoming ready"
+ _tail_msg())
if _health_ok(url):
print(f"[parler] service ready for {model_name} at {url}", flush=True)
return url
time.sleep(2)
stop_service(model_name)
raise RuntimeError(f"Parler worker for {model_name} did not become ready in time"
+ _tail_msg())
def stop_service(model_name: str) -> None:
with _lock:
svc = _services.pop(model_name, None)
if not svc:
return
proc = svc["proc"]
if proc.poll() is None:
try:
proc.terminate()
proc.wait(timeout=10)
except Exception:
pass
if proc.poll() is None:
try:
proc.kill()
except Exception:
pass
print(f"[parler] service for {model_name} stopped", flush=True)
def stop_all() -> None:
for name in list(_services.keys()):
stop_service(name)
import atexit as _atexit
_atexit.register(stop_all)
......@@ -45,6 +45,31 @@ global_args = None
global_file_path = None
def _spatial_task(title: str):
"""Decorator: register a spatial/3D endpoint in the unified task list so
every model type is visible there. Finishes done/error around the call."""
import functools
def deco(fn):
@functools.wraps(fn)
async def wrap(*args, **kwargs):
from codai.tasks import task_registry
tid = task_registry.register("spatial", title=title, model="spatial")
task_registry.start(tid)
try:
result = await fn(*args, **kwargs)
task_registry.finish(tid, "done")
return result
except HTTPException:
task_registry.finish(tid, "error")
raise
except Exception as e:
task_registry.finish(tid, "error", str(e)[:200])
raise
return wrap
return deco
def set_global_args(args):
global global_args
global_args = args
......@@ -500,6 +525,7 @@ class ImageTo3DRequest(BaseModel):
@router.post("/v1/images/to3d", summary="Image to 3D model")
@_spatial_task("Image → 3D")
async def image_to_3d(request: ImageTo3DRequest, http_request: Request = None):
"""Convert a 2D image to a 3D representation.
......@@ -568,6 +594,7 @@ class ImageFrom3DRequest(BaseModel):
@router.post("/v1/images/from3d", summary="Render a 3D model to an image")
@_spatial_task("3D → image")
async def image_from_3d(request: ImageFrom3DRequest, http_request: Request = None):
"""Render a 3D model (GLB/OBJ) to a 2D PNG image from a specified camera angle."""
raw = _decode_b64(request.model_data)
......@@ -601,6 +628,7 @@ class VideoTo3DRequest(BaseModel):
@router.post("/v1/video/to3d", summary="Video to 3D model")
@_spatial_task("Video → 3D")
async def video_to_3d(request: VideoTo3DRequest, http_request: Request = None):
"""Convert a 2D video to a 3D video frame-by-frame.
......@@ -642,6 +670,7 @@ class VideoFrom3DRequest(BaseModel):
@router.post("/v1/video/from3d", summary="Render a 3D model to a video")
@_spatial_task("3D → video")
async def video_from_3d(request: VideoFrom3DRequest, http_request: Request = None):
"""Render a 3D model as a 360° turntable video."""
raw = _decode_b64(request.model_data)
......@@ -675,6 +704,7 @@ class Generate3DRequest(BaseModel):
@router.post("/v1/3d/generate", summary="Generate a 3D model from a prompt")
@_spatial_task("Generate 3D")
async def generate_3d(request: Generate3DRequest, http_request: Request = None):
"""Generate a 3D model (GLB) from a text prompt and/or an image.
......
This diff is collapsed.
......@@ -135,6 +135,32 @@ async def create_transcription(
if len(file_content) > _MAX_AUDIO_BYTES:
raise HTTPException(status_code=413, detail="Audio file too large (max 100 MB)")
# Register a task so transcription appears in the unified task list, like
# every other model type. Finished on success or error below.
from codai.tasks import task_registry
_tid = task_registry.register(
"transcription",
title=(file.filename or "audio")[:80],
model=model or "",
)
task_registry.start(_tid)
try:
_resp = await _run_transcription(
file_content, model, language, prompt, response_format, temperature, file)
task_registry.finish(_tid, "done")
return _resp
except HTTPException:
task_registry.finish(_tid, "error")
raise
except Exception as e:
task_registry.finish(_tid, "error", str(e)[:200])
raise
async def _run_transcription(
file_content: bytes, model: str, language, prompt, response_format, temperature, file
):
"""Core transcription logic; registered as a task by create_transcription()."""
# Check if the requested model maps to a configured whisper-server instance first.
# Try alias round-robin resolution before direct ID lookup.
whisper_model_id = multi_model_manager.resolve_whisper_alias_model_id(model)
......
......@@ -28,6 +28,7 @@ from pydantic import BaseModel, ConfigDict
# Import from codai modules
from codai.models.manager import multi_model_manager
from codai.api import tts_backends
# Global reference to be set by coderai
......@@ -40,6 +41,20 @@ def set_global_args(args):
global_args = args
# Substrings that mark a model as a text/classifier/embedding model wrongly routed
# to TTS (e.g. an emotion classifier exposed under a stray ``tts:`` alias).
_NON_TTS_HINTS = (
"go_emotions", "roberta", "bert", "embedding", "e5-", "minilm",
"classifier", "toxic", "reranker", "sentence-transformers",
)
def _family_is_text_model(model_name: str) -> bool:
"""Heuristic guard: True when the model is clearly not a speech synthesizer."""
n = (model_name or "").lower()
return any(h in n for h in _NON_TTS_HINTS)
# =============================================================================
# Router and Endpoints
# =============================================================================
......@@ -72,6 +87,16 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
Supports:
- Kokoro TTS models (when --tts-model is specified)
"""
# Register a task so TTS shows up in the unified task list / dashboard,
# like every other model type. Finished on success or error below.
from codai.tasks import task_registry, loading_task
_tid = task_registry.register(
"tts",
title=(request.input or "")[:80],
model=(request.model or request.voice_profile or "tts"),
)
task_registry.start(_tid)
try:
# If a voice profile is requested, delegate to voice cloning (F5-TTS)
if request.voice_profile:
from codai.api.voice_clone import _load_voice, _f5tts_clone
......@@ -96,6 +121,7 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
except Exception as e:
raise HTTPException(status_code=500, detail=f"Voice cloning failed: {e}")
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
task_registry.finish(_tid, "done")
return {"audio": audio_base64}
# Use the manager to resolve the model and manage VRAM
......@@ -111,7 +137,7 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
model_name = model_info['model_name']
model_key = model_info['model_key']
kokoro_model = model_info['model_object']
tts_backend = model_info['model_object']
# If no TTS model configured, return an error
if not model_name:
......@@ -120,35 +146,42 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
detail="TTS not configured. Use --tts-model to specify a model."
)
# Try to use kokoro if available
try:
from kokoro import Kokoro
# Reject text/classifier models that aren't actually speech synthesizers.
if _family_is_text_model(model_name):
raise HTTPException(
status_code=404,
detail=(f"Model '{model_name}' is a text model and cannot be used for "
"tts generation. Use a TTS model (e.g. a kokoro/XTTS/Bark model).")
)
if kokoro_model is None:
print(f"Loading Kokoro TTS model: {model_name}")
try:
from codai.api import tts_backends
# Check if model_name is a URL - download it (with caching)
model_path = None
if model_name.startswith('http://') or model_name.startswith('https://'):
print(f"Loading model from URL: {model_name}")
from codai.models.cache import load_model
model_path = load_model(model_name)
if not model_path:
raise Exception(f"Failed to load model from {model_name}")
else:
# Use local path or model name
if tts_backend is None:
print(f"Loading TTS model: {model_name}")
model_path = model_name
# Load the Kokoro model
kokoro_model = Kokoro(model_path if model_path else model_name)
multi_model_manager.add_model(model_key, kokoro_model)
if model_name.startswith(('http://', 'https://')):
from codai.models.cache import load_model
model_path = load_model(model_name) or model_name
cfg = multi_model_manager.config.get(model_key) or \
multi_model_manager.config.get(f"tts:{model_name}") or {}
with loading_task(model_name, model_type="tts"):
tts_backend = await asyncio.to_thread(
tts_backends.load_backend, model_name, model_path, cfg)
multi_model_manager.add_model(model_key, tts_backend)
multi_model_manager.current_model_key = model_key
# Generate speech
voice = request.voice or "af_sarah"
voice = request.voice or getattr(tts_backend, "default_voice", "")
speed = request.speed or 1.0
lang = getattr(request, "language", None) or "en-us"
emotion = getattr(request, "emotion", None) or ""
style = getattr(request, "style", None) or ""
fmt = request.response_format or "wav"
audio_bytes = kokoro_model.generate(request.input, voice=voice, speed=speed)
samples, sample_rate = await asyncio.to_thread(
tts_backend.synthesize, request.input, voice, speed, lang, emotion, style)
audio_bytes, out_fmt = await asyncio.to_thread(
tts_backends.encode_audio, samples, sample_rate, fmt)
try:
from codai.api.archive import archive_manager
......@@ -157,27 +190,29 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
"tts", "/v1/audio/speech",
model_name,
request.input,
{"voice": voice, "speed": speed, "response_format": request.response_format},
[(audio_bytes, request.response_format or "mp3")],
{"voice": voice, "speed": speed, "response_format": out_fmt},
[(audio_bytes, out_fmt)],
))
except Exception:
pass
# Convert to base64
audio_base64 = base64.b64encode(audio_bytes).decode('utf-8')
task_registry.finish(_tid, "done")
return {"audio": audio_base64}
return {
"audio": audio_base64
}
except ImportError as e:
# kokoro not installed
raise HTTPException(
status_code=501,
detail=f"TTS not available. Install kokoro: pip install kokoro. Error: {str(e)}"
)
except HTTPException:
raise
except tts_backends.MissingEngineError as e:
# Missing optional engine (e.g. coqui-tts) → actionable 501.
raise HTTPException(status_code=501, detail=str(e))
except Exception as e:
print(f"TTS error: {e}")
import traceback
traceback.print_exc()
raise HTTPException(status_code=500, detail=f"TTS error: {str(e)}")
except HTTPException:
task_registry.finish(_tid, "error")
raise
except Exception as e:
task_registry.finish(_tid, "error", str(e)[:200])
raise
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
"""ds4 (DeepSeek V4) proxy backend.
ds4-server already speaks the OpenAI HTTP API, so this backend is a thin proxy: it
forwards chat/completion requests to the managed ``ds4-server`` subprocess (whose
lifecycle is owned by :mod:`codai.api.ds4_worker`) and adapts the responses to the
:class:`~codai.backends.base.ModelBackend` contract the model manager expects.
Tool/think parsing is handled the same way as the other backends — by
``ModelParserAdapter`` over the returned text — so tools are not forwarded to
ds4-server; the text-level ``DeepSeekParser`` extracts ``<think>`` and tool calls.
"""
import asyncio
import threading
from typing import AsyncGenerator, Dict, List, Optional
from codai.backends.base import ModelBackend
class Ds4Backend(ModelBackend):
"""Proxy backend that routes generation to a managed ds4-server."""
def __init__(self, cfg=None):
# cfg is a codai.config.Ds4Config. When omitted, resolve the active one.
if cfg is None:
from codai.config import Ds4Config
cfg = Ds4Config()
self._cfg = cfg
self._model_id = getattr(cfg, "model_id", "deepseek-v4") or "deepseek-v4"
self._url: Optional[str] = None
self._ctx = int(getattr(cfg, "ctx", 100000) or 100000)
self._last_usage: Dict = {}
# ------------------------------------------------------------------ #
# lifecycle
# ------------------------------------------------------------------ #
def load_model(self, model_name: str, **kwargs) -> None:
from codai.api import ds4_worker
if model_name:
self._model_id = model_name
self._url = ds4_worker.ensure_service(self._cfg)
def get_model_name(self) -> str:
return self._model_id
def get_context_size(self) -> int:
return self._ctx
def get_last_usage(self) -> dict:
return dict(self._last_usage)
def cleanup(self) -> None:
from codai.api import ds4_worker
ds4_worker.stop_service(getattr(self._cfg, "model_id", self._model_id))
self._url = None
# ------------------------------------------------------------------ #
# helpers
# ------------------------------------------------------------------ #
def _base(self) -> str:
if not self._url:
raise RuntimeError("ds4 service not started")
return self._url
def _store_usage(self, usage: dict) -> None:
if usage:
self._last_usage = {
"prompt_tokens": usage.get("prompt_tokens", 0),
"completion_tokens": usage.get("completion_tokens", 0),
"total_tokens": usage.get("total_tokens", 0),
}
def format_messages(self, messages) -> str:
# ds4-server applies DeepSeek V4's own chat template server-side; this is only
# used by callers that need a flat prompt string.
parts = []
for m in messages:
role = m.get("role") if isinstance(m, dict) else getattr(m, "role", "")
content = m.get("content") if isinstance(m, dict) else getattr(m, "content", "")
parts.append(f"{role}: {content}")
return "\n".join(parts)
def _chat_payload(self, messages, max_tokens, temperature, top_p, stop, stream):
payload = {
"model": self._model_id,
"messages": messages,
"temperature": temperature,
"top_p": top_p,
"stream": stream,
}
if max_tokens:
payload["max_tokens"] = max_tokens
if stop:
payload["stop"] = stop
return payload
# ------------------------------------------------------------------ #
# chat-level generation (preferred by the manager)
# ------------------------------------------------------------------ #
def generate_chat(self, messages: List[Dict], max_tokens=None, temperature=0.7,
top_p=1.0, stop=None, tools=None, response_format=None):
import requests
payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, False)
if response_format and response_format.get("type") == "json_object":
payload["response_format"] = {"type": "json_object"}
r = requests.post(self._base() + "/v1/chat/completions", json=payload, timeout=3600)
r.raise_for_status()
data = r.json()
self._store_usage(data.get("usage", {}))
return data["choices"][0]["message"].get("content") or ""
async def generate_chat_stream(self, messages: List[Dict], max_tokens=None,
temperature=0.7, top_p=1.0, stop=None, tools=None,
response_format=None) -> AsyncGenerator[str, None]:
payload = self._chat_payload(messages, max_tokens, temperature, top_p, stop, True)
async for chunk in self._stream(self._base() + "/v1/chat/completions", payload,
delta_key="delta"):
yield chunk
# ------------------------------------------------------------------ #
# plain completion (fallback path)
# ------------------------------------------------------------------ #
def generate(self, prompt: str, max_tokens=None, temperature: float = 0.7,
top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
presence_penalty: float = 0.0, frequency_penalty: float = 0.0) -> str:
return self.generate_chat([{"role": "user", "content": prompt}],
max_tokens, temperature, top_p, stop)
async def generate_stream(self, prompt: str, max_tokens=None, temperature: float = 0.7,
top_p: float = 1.0, stop=None, repeat_penalty: float = 1.0,
presence_penalty: float = 0.0,
frequency_penalty: float = 0.0) -> AsyncGenerator[str, None]:
async for chunk in self.generate_chat_stream(
[{"role": "user", "content": prompt}], max_tokens, temperature, top_p, stop):
yield chunk
# ------------------------------------------------------------------ #
# SSE streaming: iterate the blocking requests stream on a worker thread
# and hand chunks to the event loop through an asyncio.Queue.
# ------------------------------------------------------------------ #
async def _stream(self, url: str, payload: dict, delta_key: str
) -> AsyncGenerator[str, None]:
import json
loop = asyncio.get_event_loop()
queue: asyncio.Queue = asyncio.Queue()
_SENTINEL = object()
def _worker():
import requests
try:
with requests.post(url, json=payload, stream=True, timeout=3600) as r:
r.raise_for_status()
for raw in r.iter_lines(decode_unicode=True):
if not raw or not raw.startswith("data:"):
continue
data = raw[len("data:"):].strip()
if data == "[DONE]":
break
try:
obj = json.loads(data)
except ValueError:
continue
choice = (obj.get("choices") or [{}])[0]
text = (choice.get(delta_key) or {}).get("content") or ""
if text:
loop.call_soon_threadsafe(queue.put_nowait, text)
if obj.get("usage"):
self._store_usage(obj["usage"])
if choice.get("finish_reason"):
break
except Exception as exc: # surface to the consumer
loop.call_soon_threadsafe(queue.put_nowait, exc)
finally:
loop.call_soon_threadsafe(queue.put_nowait, _SENTINEL)
threading.Thread(target=_worker, daemon=True).start()
while True:
item = await queue.get()
if item is _SENTINEL:
break
if isinstance(item, Exception):
raise item
yield item
......@@ -621,6 +621,27 @@ class VulkanBackend(ModelBackend):
else:
raise ValueError(f"Could not cache model from URL: {model_path}")
# Fallback: a configured .gguf path that no longer exists (e.g. the file was
# downloaded into the GGUF cache rather than the HF-hub snapshot the entry
# points at, or a stale snapshot hash). Look for the same filename in the
# GGUF cache dir before giving up — the model loads without re-editing the
# config entry.
if model_path.endswith('.gguf') and not os.path.exists(model_path):
try:
from codai.models.cache import get_model_cache_dir
_base = os.path.basename(model_path)
_cache = get_model_cache_dir()
_cand = os.path.join(_cache, _base)
if not os.path.exists(_cand):
import glob as _glob
_hits = _glob.glob(os.path.join(_cache, "**", _base), recursive=True)
_cand = _hits[0] if _hits else _cand
if os.path.exists(_cand):
print(f" Model path missing; resolved from GGUF cache: {_cand}")
model_path = _cand
except Exception:
pass
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model file not found: {model_path}")
......
......@@ -49,7 +49,13 @@ def build_hardware_summary() -> Dict[str, Any]:
total_vram_mb = 0
available_vram_mb = 0
# Only use torch if it's ALREADY loaded (i.e. we're in an engine). Never import
# it here — the front is torch-free and must stay that way (importing torch in
# the front is heavy and would initialise CUDA in the wrong process).
import sys as _sys
try:
if "torch" not in _sys.modules:
raise ImportError("torch not loaded (front) — using torch-free path")
import torch
if torch.cuda.is_available():
......@@ -76,6 +82,23 @@ def build_hardware_summary() -> Dict[str, Any]:
except Exception:
pass
# Torch-free path (e.g. the front, which imports no torch): enumerate every
# physical card via nvidia-smi + sysfs so VRAM is reported for the whole node.
if not gpus:
try:
from codai.frontproxy.gpu_detect import gpu_stats
for c in gpu_stats():
total_mb = int(round((c.get("mem_total") or 0) * 1024))
used_mb = int(round((c.get("mem_used") or 0) * 1024))
if total_mb <= 0:
continue
gpus.append({"name": c.get("name") or c.get("vendor"),
"total_vram_mb": total_mb})
total_vram_mb += total_mb
available_vram_mb += max(0, total_mb - used_mb)
except Exception:
pass
if not gpus:
for total_path in sorted(glob.glob("/sys/class/drm/card*/device/mem_info_vram_total")):
used_path = total_path.replace("vram_total", "vram_used")
......
......@@ -60,8 +60,13 @@ def _is_text_response(content_type: str | None) -> bool:
)
async def execute_broker_request(app, envelope):
"""Validate and execute a broker request envelope."""
async def execute_broker_request(app, envelope, executor=None):
"""Validate and execute a broker request envelope.
``executor`` is an ``async (method, path, headers, query, body) -> {status_code,
headers, body}`` callable. When omitted the request is run in-process against
``app`` via the ASGI bridge (engine / single-process mode). The front passes its
own executor that proxies to the right engine over HTTP."""
logger.debug(
"broker dispatch → op=%s request_id=%s path=%r method=%r stream=%s",
......@@ -136,6 +141,12 @@ async def execute_broker_request(app, envelope):
headers["content-type"] = envelope.content_type
started_at = perf_counter()
if executor is not None:
response = await executor(
method=envelope.method, path=envelope.path, headers=headers,
query=envelope.query, body=body,
)
else:
response = await execute_internal_request(
app,
method=envelope.method,
......
......@@ -224,6 +224,13 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
action="store_true",
help="Dump model output: raw output, parsed output, and litellm debug info",
)
parser.add_argument(
"--debug-requests",
action="store_true",
help="Log the full request/response payloads exchanged with API clients "
"(opencode, etc.): incoming messages + tools and the outgoing "
"content/tool_calls. Use to diagnose agentic tool-call loops.",
)
parser.add_argument(
"--list-cached-models",
action="store_true",
......@@ -278,4 +285,39 @@ configuration directory (--config DIR, default: OS-specific CoderAI directory).
help="Ignore any existing pipeline cache and rebuild it from scratch this "
"run (use after changing a model's quantization/precision config).",
)
# ─── Frontend/engine split ───────────────────────────────────────────────
parser.add_argument(
"--single-process",
action="store_true",
help="Run the legacy single-process server (UI/API and all model work in "
"one process). Default boots a front proxy + supervised engine "
"subprocess(es) so the web UI stays responsive during model work.",
)
parser.add_argument(
"--engine-only",
action="store_true",
help="Run this process as an engine (binds an internal localhost port, no "
"front proxy). Normally launched automatically by the front; not "
"intended to be run by hand.",
)
parser.add_argument(
"--internal-port",
type=int,
default=None,
help="Internal port for --engine-only mode (the front assigns one per engine).",
)
parser.add_argument(
"--debug-engine",
action="store_true",
help="General engine debugging in the front/engine split (engine lifecycle, "
"spawn details, health transitions). Does NOT include the internal "
"HTTP access log — use --debug-engine-web for that.",
)
parser.add_argument(
"--debug-engine-web",
action="store_true",
help="Show the internal front↔engine HTTP requests in an engine's access log "
"(proxied calls, /internal/engine-state, /healthz, …). Suppressed by "
"default since every engine only ever serves internal front traffic.",
)
return parser.parse_args()
This diff is collapsed.
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
"""Front proxy package: always-responsive web/API front + supervised engines.
See ``docs/frontend-engine-split.md`` and ``docs/process-isolation-plans.md``.
"""
from codai.frontproxy.app import run_front, build_app
__all__ = ["run_front", "build_app"]
This diff is collapsed.
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
"""Assign each configured model to exactly one engine.
With multiple engines, every engine would otherwise read the shared models.json and
register *every* model — so a model would appear on several engines at once. The
front instead computes a single **owner** per model and tells each engine which
models it owns; the engine then registers only those.
Owner precedence (per model):
1. The per-model ``engine`` pin (models.json), if that engine can run the model.
2. The configured default engine, if it can run the model.
3. Round-robin across the capability-compatible engines (balanced, deterministic),
so unpinned models spread out instead of all landing on one engine.
A model whose format no engine can serve is left unassigned (it can't run anyway).
"""
import json
# models.json categories that hold servable model entries.
_CATEGORIES = (
"text_models", "gguf_models", "vision_models", "image_models",
"audio_models", "tts_models", "video_models", "audio_gen_models",
"embedding_models", "spatial_models",
)
def _entry_path(entry):
"""The model's path/id — used for capability detection (e.g. is it a .gguf)."""
if isinstance(entry, str):
return entry
if isinstance(entry, dict):
return entry.get("path") or entry.get("id")
return None
def _route_key(entry):
"""The identifier clients address this entry by (alias > path > id).
Keying on the alias lets two *configs* of the same model — with distinct
aliases — be assigned to different engines; configs sharing a path with no
distinct alias collapse to one owner (they're not separately addressable)."""
if isinstance(entry, str):
return entry
if isinstance(entry, dict):
return entry.get("alias") or entry.get("path") or entry.get("id")
return None
def _required_cap(entry, ds4_cfg):
from codai.frontproxy.router import required_capability
path = _entry_path(entry) or ""
backend = entry.get("backend") if isinstance(entry, dict) else None
return required_capability(
path, backend=backend,
ds4_model_id=getattr(ds4_cfg, "model_id", None) if ds4_cfg else None,
ds4_enabled=bool(getattr(ds4_cfg, "enabled", False)) if ds4_cfg else False)
def compute_assignment(engines, models_path, default_engine=None, ds4_cfg=None):
"""Return {engine_name: [model_identifiers]} — each model owned by one engine."""
assignment = {e.name: [] for e in engines}
if not engines or not models_path:
return assignment
try:
with open(models_path) as f:
data = json.load(f)
except Exception:
return assignment
default_engine = (default_engine or "").strip().lower()
rr = {} # round-robin cursor per candidate-set signature
seen = set()
for cat in _CATEGORIES:
for entry in data.get(cat, []):
ident = _route_key(entry)
if not ident or ident in seen:
continue
cap = _required_cap(entry, ds4_cfg)
candidates = [e for e in engines if e.can_serve(cap)]
if not candidates:
continue # nothing can run it — leave unassigned
owner = None
pin = ((entry.get("engine") if isinstance(entry, dict) else "") or "").strip().lower()
if pin:
owner = next((e for e in candidates
if e.name.lower() == pin or (e.backend or "").lower() == pin), None)
if owner is None and default_engine:
owner = next((e for e in candidates
if e.name.lower() == default_engine
or (e.backend or "").lower() == default_engine), None)
if owner is None:
key = tuple(sorted(e.name for e in candidates))
i = rr.get(key, 0)
owner = candidates[i % len(candidates)]
rr[key] = i + 1
assignment[owner.name].append(ident)
seen.add(ident)
return assignment
This diff is collapsed.
This diff is collapsed.
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
"""Front-side registry of engine subprocesses.
The front never imports torch; it knows about engines only through the small,
auth-free ``/internal/engine-state`` endpoint each engine exposes on localhost.
This module holds the shared, thread-safe view the supervisor writes and the
router/aggregator read.
"""
import threading
import time
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Set
# Default model-format capabilities implied by an engine's backend:
# transformers — safetensors/HF models (CUDA only here)
# gguf — llama.cpp models (CUDA or Vulkan)
# whisper — whisper.cpp STT (CUDA or Vulkan)
# ds4 — DeepSeek V4 via the native ds4 engine (CUDA-only build)
# An NVIDIA engine can do all of them; a Vulkan (e.g. Radeon) engine does GGUF and
# whisper, but not transformers and not ds4.
_DEFAULT_CAPS = {
"nvidia": {"transformers", "gguf", "whisper", "ds4"},
"cuda": {"transformers", "gguf", "whisper", "ds4"},
"vulkan": {"gguf", "whisper"},
"opencl": {"gguf", "whisper"},
"auto": {"transformers", "gguf", "whisper", "ds4"},
}
@dataclass
class Engine:
id: int
gpu: Optional[int] # device hint for logs (CUDA/Vulkan index; None = n/a)
port: int
primary: bool = False # the engine that owns admin/auth/config traffic
name: str = "" # human label for logs
backend: str = "auto" # nvidia | vulkan | … (forced for this engine)
env: dict = field(default_factory=dict) # extra env applied at spawn
capabilities: Set[str] = field(default_factory=set) # model formats it can serve
assigned_models: Set[str] = field(default_factory=set) # routable ids it owns
url: str = ""
healthy: bool = False
loaded_models: Set[str] = field(default_factory=set)
vram: Optional[dict] = None
tasks: list = field(default_factory=list) # running/queued tasks on this engine
cooling: Optional[dict] = None # thermal cooldown state, or None when not cooling
last_ok: float = 0.0 # monotonic time of last successful poll
proc: object = None # subprocess.Popen (set by the supervisor)
def __post_init__(self):
if not self.url:
self.url = f"http://127.0.0.1:{self.port}"
if not self.name:
self.name = f"engine#{self.id}"
if not self.capabilities:
self.capabilities = set(_DEFAULT_CAPS.get(self.backend, {"transformers", "gguf"}))
def can_serve(self, required_cap: Optional[str]) -> bool:
return (not required_cap) or (required_cap in self.capabilities)
class EngineRegistry:
def __init__(self):
self._engines: Dict[int, Engine] = {}
self._lock = threading.RLock()
def add(self, engine: Engine) -> None:
with self._lock:
self._engines[engine.id] = engine
def get(self, engine_id: int) -> Optional[Engine]:
with self._lock:
return self._engines.get(engine_id)
def all(self) -> List[Engine]:
with self._lock:
return list(self._engines.values())
def healthy(self) -> List[Engine]:
with self._lock:
return [e for e in self._engines.values() if e.healthy]
def primary(self) -> Optional[Engine]:
"""The engine that owns admin/session/config — falls back to first healthy."""
with self._lock:
prim = next((e for e in self._engines.values() if e.primary), None)
if prim and prim.healthy:
return prim
return next((e for e in self._engines.values() if e.healthy), prim)
def by_name(self, name: Optional[str]) -> Optional[Engine]:
"""Resolve an engine by its declared name (or, failing that, its backend).
Used for the configured default engine and per-model pins. Prefers a healthy
match but returns an unhealthy one too, so callers can decide."""
if not name:
return None
name = name.strip().lower()
with self._lock:
engines = list(self._engines.values())
match = None
for e in engines:
if (e.name or "").lower() == name or (e.backend or "").lower() == name:
if e.healthy:
return e
match = match or e
return match
def update_state(self, engine_id: int, *, healthy: bool,
loaded_models=None, vram=None, tasks=None,
cooling=False) -> None:
with self._lock:
e = self._engines.get(engine_id)
if not e:
return
e.healthy = healthy
if healthy:
e.last_ok = time.monotonic()
if loaded_models is not None:
e.loaded_models = set(loaded_models)
if vram is not None:
e.vram = vram
if tasks is not None:
e.tasks = list(tasks)
elif not healthy:
e.tasks = []
if cooling is not False: # explicit None clears it
e.cooling = cooling
elif not healthy:
e.cooling = None
def engine_for_model(self, model_key: str, required_cap: Optional[str] = None) -> Optional[Engine]:
"""Return a healthy, capability-compatible engine that already has the model
resident, if any.
Matching is forgiving: exact key, short-name, or type-prefixed variants —
the same fuzzy spirit the manager uses, but read-only over loaded keys."""
if not model_key:
return None
short = model_key.split("/")[-1]
with self._lock:
for e in self._engines.values():
if not e.healthy or not e.can_serve(required_cap):
continue
for k in e.loaded_models:
if k == model_key or k.split("/")[-1] == short \
or k.endswith(model_key) or model_key.endswith(k.split(":")[-1]):
return e
return None
def engine_for_assigned(self, model_key: str) -> Optional[Engine]:
"""The engine the front ASSIGNED this model to (single owner), or None.
The assignment is the authoritative routing decision (it already encodes
pins, the default engine, and balanced auto-selection); match leniently so a
short-name / alias resolves to the owner."""
if not model_key:
return None
short = model_key.split("/")[-1]
with self._lock:
for e in self._engines.values():
if not e.healthy:
continue
for k in e.assigned_models:
if (k == model_key or k.split("/")[-1] == short
or k.endswith(model_key) or model_key.endswith(k.split("/")[-1])):
return e
return None
def least_loaded(self, required_cap: Optional[str] = None) -> Optional[Engine]:
"""Pick a healthy, capability-compatible engine to load a new model on:
fewest resident models, then most free VRAM."""
with self._lock:
cands = [e for e in self._engines.values()
if e.healthy and e.can_serve(required_cap)]
if not cands:
return None
def _free(e: Engine) -> float:
return (e.vram or {}).get("free", 0.0) if e.vram else 0.0
cands.sort(key=lambda e: (len(e.loaded_models), -_free(e)))
return cands[0]
This diff is collapsed.
This diff is collapsed.
......@@ -21,6 +21,7 @@ from threading import Lock
from typing import List, Optional
import json
import os
import re
import time
......@@ -179,11 +180,15 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
return caps
# ── Image: upscaling (checked before general SD rule to catch SD-family upscalers) ──
if any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
'bsrgan', 'hat-', 'dat-',
# 'hat-'/'dat-' are short, ambiguous tokens (e.g. they appear inside
# "chat-", "update-"); require a word boundary before them so a text "chat"
# model isn't mistaken for the HAT/DAT super-resolution checkpoints.
if (any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
'bsrgan',
'x2-upscaler', 'x4-upscaler', 'x2_upscaler', 'x4_upscaler',
'latent-upscaler', 'latent_upscaler',
'ldm-super-resolution', 'rcan-', 'sr3-']):
'ldm-super-resolution', 'rcan-', 'sr3-'])
or re.search(r'\b[hd]at-', n)):
caps.image_upscaling = True
caps.image_to_image = True
return caps
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
python tools/video_editor.py --no-browser --host 0.0.0.0 --media-dir tools/coderai_media --session
tools/gen_township_fighters.py -c township_output/township_config.json
This diff is collapsed.
# Expressive TTS (emotion / delivery)
The video editor shows **Emotion** and **Delivery** dropdowns whenever the
configured TTS model advertises them (`codai/api/tts_backends.py`:
`family_emotions` / `family_styles`). Two engines support expressive control.
## Bark — in-stack, no extra deps
Works with the server's current `transformers`. Configure a Bark model as the
TTS model, e.g. `--tts-model suno/bark` (or `suno/bark-small`).
- **Delivery**: `normal`, `whispering` (`[whispers] …`), `singing` (`♪ … ♪`),
`emphasis` (UPPERCASE).
- **Emotion**: inserts a matching non-verbal cue — `laughter``[laughs]`,
`sigh``[sighs]`, `gasp``[gasps]`.
- **Voice**: a Bark preset like `v2/en_speaker_6`. The editor's Kokoro voice ids
don't apply and fall back to the default preset (set `voice_preset` in the
model config to change it). Speed isn't controllable in Bark.
## Parler — fully managed by coderai (no setup)
`parler-tts` pins an old `transformers`/`tokenizers`/`huggingface-hub` that
**conflict with this server** — never `pip install` it into the coderai venv.
coderai handles this for you: just use a Parler model as the TTS model
(e.g. `parler-tts/parler-tts-mini-multilingual`). The worker is launched lazily —
only when a request for that model actually arrives — and shut down when the
model is evicted, exactly like loading/unloading any other model. On first use it
1. creates a dedicated venv at `~/.coderai/parler_venv`
(override with `CODERAI_PARLER_VENV`), built `--system-site-packages` so the
base torch/numpy are reused and only the conflicting packages land in it;
2. `pip install`s parler-tts there;
3. launches `tools/parler_tts_service.py` in that venv on a local port, pointing
`HF_HUB_CACHE` at coderai's own cache and forcing **offline mode**
(`HF_HUB_OFFLINE=1`) so it loads strictly the model you **already downloaded
via the model interface** — the worker never downloads anything itself;
4. health-checks it and routes synthesis to it.
The worker is owned by `codai/api/parler_worker.py`; the backend's `cleanup()`
calls `stop_service()`, so the model manager's normal eviction tears the process
down. The first request blocks while the venv builds, then it's cached.
If the model isn't in coderai's cache, the worker fails fast with a clear error
("download '<model>' from the model interface first") instead of fetching it.
Download the Parler model through the normal HF download UI first.
The editor's **Emotion**/**Delivery** dropdowns drive it: coderai POSTs
`{text, voice, speed, emotion, style}` to the worker, which maps them into a
natural-language delivery description (whisper / shout / monotone / expressive +
emotion + pace). A fixed `description` in the model config overrides the
auto-built one. An explicit `service_url` in the config bypasses management and
talks to an externally-run service instead.
> The model must still be in the server's allowed-models registry to be
> selectable — that's the only configuration; the worker itself needs none.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment