Commits · f0dcf7eb2a2687135132f13eb2662aea05585ebf · nexlab / coderai

19 Jun, 2026 27 commits

admin: strict 1:1 whisper model<->runner linked by config alias == runner id · f0dcf7eb

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

The whisper-server runner's model_id is inherited from the gguf MODEL config's
alias, which links the two. So:
- Adding a model config creates one runner whose id is the config's alias
  (auto-minted + stamped onto the config when no alias is given).
- Removing a config (by config_id or by path) tears down the runner whose id
  matches that config's alias — one config removed = one runner removed/killed.

Replaces the interim model_config_id link.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

f0dcf7eb

whisper: account a running runner as a loaded model for VRAM eviction · 2a214215

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Starting a whisper-server runner loads the gguf onto the GPU, but it was
invisible to the VRAM-eviction logic — it never evicted others to make room,
recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.

- WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
  free its VRAM like any other model.
- MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
  other models if free VRAM is short, start the subprocess, and register it in
  models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
  trigger for eviction and an eviction candidate.
- stop_whisper_server(): stop + clear all that accounting (frees VRAM).
- Routed every start/stop through these: on-request transcription, engine
  startup pre-load, admin model-load (Load button) and model-unload/disable.

So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2a214215

admin: whisper gguf model auto-manages its runner (1:1) · 3d551444

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Model a whisper gguf as two things: a MODEL config (a .gguf entry with
backend=whisper-server and NO model_path — enables the model, holds load
strategy, shown on the GGUF row) and a RUNNER (backend=whisper-server WITH
model_path — the subprocess, shown in the whisper card).

- Enabling a .gguf with speech_to_text marks it backend=whisper-server and
  auto-creates exactly one runner (1:1) on a free port.
- Disabling the model removes + kills all its runners (cascade by model_path).
- Removing a runner (or model) now stops the subprocess + drops registry
  entries, instead of leaving it running until restart.
- cached-models shows the model config on the GGUF row but excludes runners;
  the whisper card shows only runners (require model_path).
- engine startup only launches runners (entries with model_path), never the
  bare model config.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

3d551444

admin: keep GGUF model config and whisper-server runner config separate · 41e3661e

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

A GGUF model config (configure + enable the model, load strategy) and a
whisper-server config (the runner: port, gpu, which model) are two distinct
things. Showing whisper-servers as the backing file's configs made the GGUF
row's "Configure" open the whisper form — conflating them.

Whisper-server entries are again excluded from a GGUF file's editable config
list (they live only in their own card); the GGUF row's Configure opens the
general model config modal. The file still reflects "loaded" via its
model_path in the loaded-status sets.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

41e3661e

admin: show whisper-servers as gguf configs again, edit via whisper editor · 615967d8

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Reverted the over-correction that hid whisper-server entries from the
backing GGUF file's config list — they should appear there (the file is
configured through them). To avoid the duplicate-config bug, editing a
whisper-server config from the GGUF row now routes to the whisper editor
(which updates in place by id) instead of the general config modal. Pills
for whisper-server configs are labelled by id so the two instances are
distinguishable (they share the "whisper" alias).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

615967d8

admin: fix duplicate gguf configs from whisper-server pollution · ea8dd92c

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Editing a GGUF model's config kept appending duplicate entries. Root cause:
api_cached_models added whisper-server entries to the backing GGUF file's
config list (keyed by model_path). With whisper0/whisper1 both pointing at
the file, the GGUF row's configs[0] became a whisper-server entry, which
carries no config_id — so "Configure" on that row treated every save as a
brand-new config and spawned a fresh duplicate each time.

- cached-models: skip whisper-server entries entirely (they're managed in
  their own card; the file still shows "loaded" via its model_path key).
- model-configure (whisper-server): update an existing entry in place when
  the id matches instead of 409-or-append, preserving unmanaged fields
  (engine, config_id).
- model-disable: guard against whisper-server entries' path=None so a
  path-based disable can't crash on basename(None).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ea8dd92c

admin: stop the models page jumping to the bottom on refresh · d7636907

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Clicking Load/Unload (or any refreshLocal) re-ran loadCachedModels, which
blanked the HF/GGUF lists to "Loading…" every time. That collapsed the page
height and threw the viewport to the end, so a whisper-server unload looked
like it "did nothing and scrolled to the bottom" even though it worked.

Now the "Loading…" placeholder only shows on the first (empty) load; on a
refresh the existing rows stay in place and the scroll position is captured
and restored around the re-render.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

d7636907

admin: whisper-server load mode derives from the backing GGUF config · b717f1dc

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Removed the redundant "Load mode" dropdown from both whisper-server forms
(builder + edit modal). A whisper-server is backed by a GGUF model, so its
load mode now derives from that GGUF's configured load_mode (default
on-request) instead of being set separately in two places. Load/offload
strategy stays solely in the GGUF model config.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

b717f1dc

admin: whisper-server unload clears registry + gguf file shows loaded · 2b8e69df

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Two whisper-server UI issues on a multi-engine node:

- Unload didn't visibly take effect: a whisper-server is registered both as
  a subprocess (whisper_servers) AND in the generic .models/.model_pools
  registry under audio:<id>. The unload stopped the subprocess but left the
  registry entry, so the row kept showing "Unload". Now it also drops the
  audio:<id>/<id> registry entries (and matches by id, audio:<id>, or the
  gguf model_path, so unloading the file stops every server using it).

- The backing gguf file showed "Load" while its whisper-server was running.
  Surface each running server's _model_path in the loaded-key sets
  (engine-state, model-loaded-status, status) so the GGUF-file row reflects
  that the file is in use.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2b8e69df

admin: show models pinned to a secondary engine (whisper form + loaded) · 25cb3b80

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

/admin/api/models is served by the primary engine and ran list_models()
with the per-engine assignment filter, so models pinned to a secondary
engine (whisper-servers configured with engine=radeon) were dropped. That
emptied the whisper-server config form and left those models unmatched in
the loaded-status check, so they never showed as loaded.

Add list_models(all_engines=True) to bypass _entry_assigned, and call it
from the admin model list. Combined with the whisper-servers now reported
in /internal/engine-state, the radeon whisper-servers reappear in their own
form and show as loaded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

25cb3b80

front: include running whisper-servers in /internal/engine-state · 801d21a9

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

The supervisor polls /internal/engine-state (not /admin/api/status) to
populate each engine's loaded_models in the registry, which the front then
aggregates for the models page. That endpoint only read .models, so a
whisper-server running on a secondary engine still showed as not loaded.
Fold each running server (id + `audio:` alias) into its loaded_models.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

801d21a9

admin: actually free VRAM on unload + show whisper-server as loaded · 84def90a

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Two issues when unloading/reporting models on a multi-engine node:

- Unload didn't free VRAM for pooled models. api_model_unload only popped
  multi_model_manager.models and never touched model_pools, so a model
  served with max_instances>1 (which lives only in the pool) kept all its
  instances resident. Now it searches both dicts and calls unload_model(),
  which cleans up the whole pool + runs gc/empty_cache. Also handles
  whisper-server models (their own subprocess) by stopping the server.

- whisper-server showed as "not loaded". It runs as a subprocess tracked
  in whisper_servers, not in .models. Fold each running server (id +
  `audio:` alias) into both the model-loaded-status list and the
  /admin/api/status loaded_keys, so the models page, dashboard count and
  per-engine box all reflect it (incl. on a secondary engine).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

84def90a

front: route admin model load/unload to the owning engine · 8abd66c7

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

model-load/model-unload were proxied to the primary engine, so unloading
(or loading) a model that lives on a secondary engine hit the wrong process
and silently no-op'd (was_loaded=False). Add front-proxy interceptors:

- unload: find the engine whose loaded_models matches the path and forward
  the request there; fall back to the primary.
- load: reuse an engine already serving the model, else the model's engine
  pin from models.json, else the primary.

Registered before the catch-all proxy, mirroring /admin/api/engines.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

8abd66c7

admin: show per-engine loaded models (tasks hover + cross-engine model page) · 5fdbfc54

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- Tasks page: the engine box "N models" count now has a hover tooltip
  listing every loaded model key on that engine (dotted underline + help
  cursor); "no models loaded" when empty.
- Models page: models loaded on a non-primary engine were shown as idle.
  /admin/api/model-loaded-status is served by the primary engine and only
  reported its own pool. Added a front-proxy interceptor that proxies to
  the primary then unions in every other engine's loaded_models, mirroring
  the existing /admin/api/engines and /admin/api/status aggregation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

5fdbfc54

oci: run script can enable/disable bundled demo tool web UIs · bce43398

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

The image's demo tool UIs (video editor :8420, videogen :7790, township :7788,
parler TTS :8123) are now individually toggleable at container start — no
rebuild. supervisord.conf drives each program's autostart from a CODERAI_TOOL_*
env var; the entrypoint seeds defaults (three UIs on, parler off) so the
%(ENV_...)s expansion always resolves. run_oci.sh (shipped as the dist
coderai-docker runner) gains:

  --no-tools            disable all three demo UIs
  --enable-tool NAME    force-enable one (also turns on parler)
  --disable-tool NAME   disable one; explicit toggles override --no-tools

NAME ∈ {video-editor, videogen, township, parler}. README + --help updated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bce43398

download: auto-retry Xet-only large blobs with hf_xet instead of failing · 6517dfbe

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

coderai disables hf_xet by default (it bypasses the tqdm progress hook and can
segfault the worker), but some HF blobs are Xet-only and the plain HTTPS path
refuses them with "file too large … install hf_xet" — even though hf_xet is
bundled. The first pass now holds that error instead of surfacing it, detects
the Xet-required message, and transparently retries with Xet enabled
(force_xet → HF_HUB_DISABLE_XET=0). Non-Xet errors are surfaced as before; the
existing crash→disable-Xet retry is unchanged. _attempt now returns the held
error message as a 4th tuple element.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

6517dfbe

ds4: auto-downloaded weights land in coderai GGUF cache + show on models page · ef106ba1

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

When ds4.auto_download is enabled and a deepseek4 request resolves no local
GGUF, the downloaded weight variant is now relocated into coderai's GGUF cache
(get_model_cache_dir; move on same FS, symlink across devices) and registered
in models.json as a text_models entry that mimics the requested ("failed")
model's config — backend auto, on-request, enabled and visible (removed from
unloaded/to_download). model_name is threaded ds4 backend → ensure_service →
ensure_model so the registration mirrors the right entry.

Also: settings "Extra ds4-server args" hint/placeholder updated to reflect the
auto --kv-disk-dir and SSD-streaming expert-cache sizing
(--ssd-streaming-cache-experts), noting Q2_K can fail ds4's CUDA prefill.

Diagnosis (no code change): ds4-server's "cuda prefill failed" on the 93GB
Q2_K variant is a quant-specific ds4 CUDA bug — the 154GB Q4_K completes
prefill fine (verified: "prompt done 434s" vs Q2_K instant failure), with
15.8GB VRAM free either way (not OOM, not cache budget, not coderai).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ef106ba1

fix(ds4): give ds4 models exclusive VRAM (evict others) to stop expert-cache starvation · 00e21ea5

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

ds4-server streams MoE experts and wants the whole GPU for its expert cache, but
coderai's modest VRAM estimate for a ds4 model let it co-reside another model —
starving the cache so ds4's layer-0 FFN expert encode failed ("gpu layer 0 ffn
batch encode failed"). When loading a ds4 model on demand, unload all other models
first so ds4-server gets the full GPU.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

00e21ea5

feat(auto-compact): guarantee last message, chunked summarize, signal-if-too-big · 8bfd0855

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- Always keep the CURRENT request (last message) intact and as the very last
  message after compaction (the compacted history/summary precedes it).
- summarize strategy now CHUNKs the older history and summarizes map-reduce
  (per-chunk then a combined pass) so the summarization prompt can't itself
  overflow.
- If compaction still can't fit the window (e.g. a single huge final message),
  return HTTP 400 "request too big for context" instead of failing mid-generation.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

8bfd0855

feat: per-model auto-compact of the conversation context (off by default) · a019905f

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

When enabled for a model, if the prompt would exceed auto_compact_pct% of the
model's context window, the conversation is shrunk to ~65% before generation
instead of erroring on overflow. Per-model config (auto_compact / auto_compact_pct
/ auto_compact_strategy) with three strategies:
  - drop_oldest    : keep system messages + the most recent turns that fit.
  - keep_head_tail : also keep the first user turn as an anchor + a count note.
  - summarize      : replace the dropped middle with a best-effort LLM summary
                     (generated by the loaded model; falls back to a count note).

Token size is a cheap chars/4 estimate; membership uses object identity so
value-equal turns don't collide. Wired into the chat path (codai/api/text.py),
the model-configure whitelist, and the model config modal UI.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

a019905f

fix(ds4+config): resolve bare model ids, don't over-estimate VRAM, robust config · 8c85e16a

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- ds4: resolve a bare/aliased model id (e.g. "Foo-ds4-Q2_K", no path/extension) to
  its configured .gguf via a config/cache-aware resolver — fixes the 503 ("no local
  deepseek4 GGUF resolved") on chat requests (only "Load now" with a full path
  worked before). Ds4Backend reuses the same resolver.
- ds4: report a modest VRAM footprint for ds4 models (measured or ~12GB) instead of
  the 100GB+ GGUF size — ds4-server streams experts from SSD and manages its own
  memory, so the old estimate forced needless ~128GB eviction churn every request.
- ds4: route on-disk KV checkpoints into coderai's offload directory by default
  (--kv-disk-dir <offload>/ds4-kv) unless overridden in extra_args.
- config: tolerant load (_dc drops unknown keys) so a stale/newer config.json never
  crashes the whole load and silently resets ALL settings to defaults (the "had to
  reconfigure everything" bug). save_config + GET/POST settings carry the new ds4
  fields (model_path, auto_download, ssd_streaming).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

8c85e16a

feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file · 6a153c58

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read
  from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs
  stay on llama.cpp; the model_id alias still routes for the download case.
- ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a
  local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the
  managed service per file). No fixed-variant assumption.
- Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx).
- New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream
  MoE experts from SSD/disk), model_path (explicit -m override), and
  auto_download (OFF by default — only serve GGUFs already present; error clearly
  instead of silently pulling tens of GB; opt in to fetch model_variant).
- AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml
  ops) → ds4 for now; and ds4 routing/offload/text-only specifics.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

6a153c58

fix: tool-call streaming/format robustness + clear over-context error · 3834ecf5

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- Streaming tool gate now withholds the gemma/qwen native `<|tool_call>` marker
  (and partials) too, not just `<tool_call>`/`call:NAME{` — so the raw marker no
  longer leaks to the client mid-stream (Kilo was executing partial calls).
- Normalize tool-call function.arguments from JSON string → dict before applying
  the chat template, so templates that render `arguments|items` (Qwen) don't
  raise "Can only get item pairs from a mapping".
- Context-window overflow now returns a meaningful error: a structured SSE error
  event (code context_length_exceeded) when streaming, or HTTP 400 with a clear
  message for non-streaming — instead of injecting "[Generation error: …]" as
  assistant content (which polluted chat history).
- Models page: unconfigured GGUF files now expose the "Free disk" button (records
  them as "to download" before deleting), matching HF models.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

3834ecf5

fix: GGUF vision/mmproj routing + VRAM estimate; Tasks page it/s + history · ade800f9

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- api_model_load: load a GGUF/text model via llama.cpp even when it's also
  bucketed under image/vision (respect the entry's primary model_type), so a
  gemma+mmproj LLM never hits the diffusers from_pretrained() path.
- model config save: a GGUF LLM with an mmproj auto-gets the image_to_text
  capability and is kept out of the diffusers vision_models/image_models buckets.
- VRAM estimate: _runtime_reserve_gb scales the KV-cache reserve by the cache
  quantization (q4_0 ≈ 0.27× f16) so quantized-KV models at large context aren't
  over-estimated into needless CPU offload.
- Free disk (HF): quiet huggingface_hub's noisy not-found traceback and make the
  delete idempotent (repo already gone = success).
- Tasks page: generation tasks now report it/s (or s/it when slow); text keeps
  tok/s. Throughput computed centrally in the task registry (live EMA + run
  average on finish). New "Recent tasks (last 10)" history section.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ade800f9

packaging: build a self-contained distribution bundle by default · 7d3d8e5b

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

- make_dist_bundle.sh: assemble dist/coderai-docker-dist.tar containing the image
  tarball (docker save | pigz), the coderai-docker runner (run_oci.sh, image tag
  pinned), install.sh and README. Stages under dist/ (not tiny /tmp) and hardlinks
  the multi-GB image tarball instead of copying it.
- dist-bundle/install.sh: docker-load the image (sudo-fallback for daemon access)
  then install the coderai-docker runner to /usr/local/bin (root) or
  ~/.local/usr/bin (user, added to ~/.bashrc PATH if missing).
- build_oci_image.sh: after a successful build, export + bundle for distribution
  by default (--no-dist to skip).
- run_oci.sh: default image tag -> coderai:dist (matches what's shipped/loaded).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

7d3d8e5b

Merge feat/township-match-upload: to-download list, mmproj vision, styled... · 766fef3c
Stefy Lanza (nextime / spora ) authored Jun 19, 2026
```
Merge feat/township-match-upload: to-download list, mmproj vision, styled modals, broker + packaging
```
766fef3c

feat: model "to-download" list, mmproj vision, styled modals, broker + packaging · cbf7f147

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Web UI / models:
- "To download" wishlist: models known but not on disk and not configured show
  as non-configured to-download rows. Free-disk on an unconfigured model, Remove
  on a model with no files left, and a new "Add to list" button in the download
  window all record into models.json `to_download`; pruned on enable/download.
  New endpoints model-mark-download / model-unmark-download.
- mmproj multimodal components: mmproj GGUFs are classified as components (not
  models), selectable per-GGUF in the model config (auto-selected, enables vision
  capability). VulkanBackend loads them via llama.cpp's MTMDChatHandler (--mmproj
  equivalent), and the chat path now forwards image_url content end-to-end.
- All window.alert() replaced by a shared styled showAlert()/showConfirm() modal
  in base.html (used across every admin template).

Front proxy / broker:
- Fix engine model-assignment NameError (keep -> _keep).
- Brokered GET /coderai/capabilities now answers from the front (whole node) so
  multi-GPU hosts report every card, not a single engine's CUDA-visible one.
- Log a clear reason when the broker is disabled.

Packaging (distributable OCI image):
- Multi-stage venv image + smoke test; bundle ds4/wav2lip/sadtalker + parler;
  whisper-server etc. dereferenced (cp -aL) so no dangling symlinks.
- Dockerfile.update + update_oci_image.sh: ~30s incremental code-only rebuild on
  an immutable coderai:base (no 20GB bundle recopy).
- run_oci.sh: --local/--config-dir + --map to run against existing local config
  and data dirs without a rebuild; --debug[=flags] + --log-file for selectable
  debug flags and a host-tailable file log (launcher tees; supervisord kills the
  process group). tmp_janitor age-prunes the dedicated temp dir.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

cbf7f147

18 Jun, 2026 11 commits

Almost all ready · 9d023ec2
Stefy Lanza (nextime / spora ) authored Jun 18, 2026

9d023ec2

quant: reject checkpoints whose weights weren't actually quantized · c741ff5b

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

GPTQModel silently leaves layers it can't map (e.g. gemma-4's fused batched MoE
experts) in bf16, producing a near-full-size "checkpoint" that the loader would
redirect to and then offload. The worker now scans the saved safetensors and, if
<50% of large weight bytes are int-packed, deletes the output and marks the job
failed (so it falls back to bitsandbytes) instead of reporting "done".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

c741ff5b

quant: surface jobs on Tasks page + model list, persist across restart · 6d053dc1

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

- quant jobs now appear on the Tasks page (api_tasks emits kind=quantize) and as
  a live badge on the HF model-list row (polled; re-renders only on change).
- persist job state to <cache>/quantized/jobs.json; on startup a job left
  "running" is marked "interrupted" only if its owning PID is dead (merge-safe
  save so multiple processes don't clobber each other).
- gitignore the runtime model cache (models/), logs/, and the third-party
  GPTQModel/ source clone (installed into the venv, not part of this repo).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

6d053dc1

front: timestamped logs + AMD GPU marketing-name detection · 48be0d91

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

- Prefix front/uvicorn and re-emitted engine log lines with [HH:MM:SS] so the
  front log format matches the engine ([HH:MM:SS][nvidia] …); preserve tqdm
  in-place progress and avoid double-timestamping already-tagged lines.
- gpu_detect: _amd_gpu_name() resolves a card's marketing name via amdgpu
  product_name sysfs, then lspci board/chip name, then vulkaninfo.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

48be0d91

feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Smart context caching (both text backends):
- Per-instance generation lock so pooled concurrent requests can't corrupt a
  shared KV cache (GGUF + HF, incl. streaming worker thread).
- GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
- HF: replace single exact-text KV slot with an LRU of token-prefix slots +
  token-level longest-common-prefix + DynamicCache clone/crop (handles
  mid-history edits); kv_cache_slots (default 3).
- Session-affinity routing in ModelInstancePool.acquire(session_key); key from
  user/X-Session-Id else a stable prefix hash.
- RAM-pressure ladder drops reclaimable prefix caches before evicting models.

VRAM fix:
- Auto-fit check no longer double-counts the KV/activation reserve when
  expected_vram_gb is already a peak estimate — borderline models (e.g.
  gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
  device_map offload.

GPTQ/AWQ fast-kernel quant backend (HF path):
- New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
  cache, on-demand background quantize job (falls back to bnb if unsupported).
- quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
  checkpoint with Marlin/ExLlama when present, else bitsandbytes.
- Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
- requirements-nvidia.txt documents the from-source install + numpy caveat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

990f9471

tasks: show model downloads on the Tasks page · e4c040e2

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Surface out-of-process download workers (tracked in _download_status) as
first-class tasks in /admin/api/tasks, alongside generations, training and
queued requests. They render with a percentage progress bar plus a
filename / rate / ETA readout, and can be cancelled from the Tasks page
(routed through a shared _cancel_download_session helper) or removed once
finished/failed.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

e4c040e2

front: reap orphaned download workers on shutdown · a2460385

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

stop_all() now sweeps /proc for any codai.admin.download_worker processes
and SIGKILLs them after the engines are stopped — including legacy ppid=1
orphans left by an earlier instance that this front never spawned. Orphaned
workers keep holding huggingface_hub's per-blob file lock, which makes the
next re-download deadlock at 0%, so Ctrl-C now guarantees they're cleaned up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

a2460385

downloads: dedup re-downloads + kill orphaned workers; single-line load progress · 28a2eecb

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Re-downloading a model that was already in progress spawned a second
download_worker. Both contend for huggingface_hub's per-blob file lock —
the first downloads, the second blocks on the lock and reports 0% forever
("Downloading full repository…"). Two causes, both fixed:

- Same-process re-download click: api_download_model now dedups via
  _active_download_session(model_id, file_pattern) and attaches the client
  to the live session instead of spawning a rival worker.
- Restart case: workers were plain Popen children with no parent-death
  signal, so a server/engine restart orphaned them (still holding the lock)
  while the new instance lost its in-memory dedup state. Workers now spawn
  with PR_SET_PDEATHSIG=SIGKILL so they die with the server; the re-download
  then resumes cleanly from the .incomplete blob.

Also render engine "Loading weights" tqdm progress as a single updating
line on a TTY (in-place \r) and throttle to whole-percent changes when
piped, instead of one line per update.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

28a2eecb

front: show model-loading progress on the Tasks page · 3020f828

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

During a GIL-heavy from_pretrained the engine's event loop is blocked, so its
/internal/engine-state poll times out and the engine looked "down" with an empty
task list — the real loading task never reached the front. Parse load progress
from the engine's log stream (which the front already pumps) into Engine.loading
and surface it as a synthetic 'loading' task (with live step/total) in
_merge_engine_tasks, even when the primary engine is the blocked one. Cleared on
"Model loaded successfully" or the next successful poll.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

3020f828

tasks: report live tokens/s for text generation · bc9a8352

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

Add a `rate` field to the Task registry and publish step (tokens so far) +
tokens/s from the text streaming loop every few tokens; the Tasks page shows
"N tok · X.X tok/s" while a generation is running. Flows through the engine→
front task aggregation unchanged (asdict serialization).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

bc9a8352

front/engine split, ds4 + media tooling, gemma-4 native tools; ignore runtime artifacts · b297b25f

Stefy Lanza (nextime / spora ) authored Jun 18, 2026

- frontproxy: torch-free front proxy + per-vendor engine supervisor with auth,
  localhost binding, model routing; Ctrl-C now force-kills engines (own session +
  PDEATHSIG, SIGKILL of engine process groups, watchdog on hung drain)
- gemma-4 tool calling: prompt via native tools= template, parse call:NAME{...}
  into tool_calls, honour generation_config EOS so it stops instead of looping
- ds4 external worker, parler/expressive TTS backends, video editor tooling
- --debug-requests: full client<->API request/response logging + live snapshots
- stop tracking runtime artifacts (video_editor/sessions/, tools/coderai_media/)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

b297b25f

17 Jun, 2026 2 commits

township: inline the upload helpers into the generator · 2fb085f4

Stefy Lanza (nextime / spora ) authored Jun 17, 2026

Fold tools/township_upload.py back into gen_township_fighters.py to match the
project's single-file convention. Odds generation, anti-arbitrage checks, ZIP
packing and the chunked upload now live alongside the other township helpers;
_best_variant reuses the existing _video_variants. Behaviour is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2fb085f4

township: pack + price + upload matches to townshipcombatleague.com · c84c6208

Stefy Lanza (nextime / spora ) authored Jun 17, 2026

Add ZIP packing, anti-arbitrage odds generation, and chunked upload of a
rendered match to the Township Combat League server (mbetterd 3-step API).

- New tools/township_upload.py: generate_odds (constraint-aware, retries up
  to 10x, verified with the server's exact sure-bet checks), check_arbitrage,
  build_match_zip (OVER/UNDER/WIN1-2/KO1-2/RET1-2/DRAW, best enhanced variant),
  upload_match (create -> chunked zip -> finalize, proxy-safe, progress_cb),
  and a content signature for upload-state invalidation.
- Run page: server endpoint/token/fixture-id, "upload after render" checkbox,
  and configurable odds ranges; persisted via /save-config + load_config.
- Match page: generate/regenerate odds & ZIP, upload with a progress bar
  (polls /job/<id>), and an "Uploaded" badge that clears when the match is
  re-rendered, enhanced, edited or deleted.
- Auto-upload after a full render when configured; skips (keeps local) any
  match whose odds fail the arbitrage check after 10 tries.

KO/RET odds are coupled to wins by the product cap, so high maxima are not
reachable in a no-arbitrage book; the generator samples wins first then bounds
KO/RET accordingly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

c84c6208