1. 19 Jun, 2026 35 commits
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: fold system role by template signal, not just architecture · 64eb74b7
      Stefy Lanza (nextime / spora ) authored
      Whether a model rejects the 'system' role is a property of the chat
      template baked into the specific GGUF, not the architecture: the gemma-2
      template and the official gemma template raise "System role not
      supported", while 'heretic' gemma4 quant conversions ship a permissive
      template that accepts system. Detect from the embedded
      tokenizer.chat_template (raise_exception/"system role") and fold only
      when it actually rejects system; fall back to architecture (Gemma) when
      no template is readable. Avoids needlessly folding permissive Gemma
      models while still covering gemma-2-9b and strict non-Gemma templates.
      The runtime "System role not supported" retry remains as a safety net.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      64eb74b7
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: fold system message into user turn when template rejects it · 39a62745
      Stefy Lanza (nextime / spora ) authored
      Gemma's chat template has no 'system' role; llama.cpp raises "System
      role not supported" and the generation fails (the Kilo client always
      sends a system prompt). On that specific error, retry with the system
      message(s) folded into the first user turn — Gemma's own convention,
      and a no-op for models that accept system. Handles both streaming and
      non-streaming paths and preserves multimodal (list) content.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      39a62745
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: resolve bare-alias gguf locally before trying HuggingFace · eb138bfa
      Stefy Lanza (nextime / spora ) authored
      VulkanBackend.load_model treated any id that isn't a '.gguf' path / file
      / URL as a HuggingFace repo to download. A configured gguf addressed by
      its automatic alias ('coe-…-q4_k_m', no extension) thus 404'd against
      the Hub instead of loading the local file. Resolve the alias via
      _resolve_local_gguf (configured-entry + cache-dir match) first; only fall
      back to the HF path when no local gguf is found.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      eb138bfa
    • Stefy Lanza (nextime / spora )'s avatar
      front: route gguf bare alias by capability to its real engine, not nvidia · 482e47cf
      Stefy Lanza (nextime / spora ) authored
      pick_engine honours the front's assignment (radeon) only if the engine
      can_serve the request's required capability. But _required_cap derived
      that capability from the bare alias 'coe-…-q4_k_m' — no literal 'gguf' —
      so required_capability returned 'transformers' (CUDA-only). radeon is
      gguf-only, failed can_serve, and the request fell through to the default
      engine (nvidia), even though compute_assignment had correctly placed the
      model on radeon (it sees the full '…-q4_k_m.gguf' path).
      
      Resolve the model's configured path in _load_pins (now indexed by the
      .gguf-stripped stem too) and, when the name heuristic yields
      'transformers' but that path is a .gguf, correct the capability to
      'gguf'. whisper/ds4 precedence is unchanged. Combined with the registry
      stem-matching, a bare-alias request now lands on the owning Vulkan/AMD
      engine.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      482e47cf
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: read free VRAM from amdgpu sysfs (CUDA query is NVIDIA-only) · f204f399
      Stefy Lanza (nextime / spora ) authored
      _free_vram_gb() used torch.cuda.mem_get_info, which returns 0 on an
      AMD/Vulkan engine (no CUDA). That made the auto-offload sizing guard
      (_free > 0) silently false, so n_gpu_layers stayed at -1 (all) and a
      model larger than VRAM was forced entirely onto the GPU — OOM, "Failed
      to load model from file" (e.g. a 13 GB gemma4 model on an 8 GB RX 580).
      A 24 GB CUDA card has room for all layers, so the bug was invisible
      there. Fall back to amdgpu sysfs (mem_info_vram_total - vram_used,
      indexed by Vulkan device order) so AMD GPUs size partial offload too.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      f204f399
    • Stefy Lanza (nextime / spora )'s avatar
      manager: classify gguf by resolving alias to a local .gguf file · 269824b2
      Stefy Lanza (nextime / spora ) authored
      load_model() decided gguf-vs-HF purely from the literal string 'gguf' in
      the model name. A gguf whose alias carries only the quant suffix (e.g.
      'coe-gemma4-coding-hc-14b-a4b-q4_k_m', no literal 'gguf') was mis-routed
      to the HF/transformers backend, which then failed with "is not a valid
      model identifier" (503). Fall back to _resolve_local_gguf(): if the alias
      maps to an actual local .gguf, treat it as gguf and route to llama.cpp.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      269824b2
    • Stefy Lanza (nextime / spora )'s avatar
      multi-engine: route gguf automatic alias (filename without .gguf) · 2eda7574
      Stefy Lanza (nextime / spora ) authored
      A gguf model's assigned/loaded key is its file path, but /v1/models
      advertises it — and clients address it — by the filename without the
      .gguf suffix (the automatic alias). engine_for_assigned /
      engine_for_model / _key_matches_path compared short names verbatim, so
      the automatic alias never matched the .gguf key and routing fell through
      (404 / wrong engine). Normalize both sides via _short_stem so the
      automatic alias resolves to the owning engine with no manual alias.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2eda7574
    • Stefy Lanza (nextime / spora )'s avatar
      multi-engine: live /v1/models on config change + accept gguf-stem ids · 79c2e44d
      Stefy Lanza (nextime / spora ) authored
      Two bugs made a freshly-configured model unusable until a full restart on a
      multi-engine node:
      
      1. Name mismatch: list_models advertises a gguf's filename WITHOUT .gguf as an
         id, but get_all_allowed_identifiers only allowed the name WITH .gguf, so a
         request using the id from /v1/models was 404'd as "not an allowed model".
         Now the .gguf-stripped stem is allowed too.
      
      2. Stale per-engine assignment: each engine's /v1/models is filtered by the
         assignment set fixed at startup, and secondary engines never re-read
         models.json — so an added/removed model didn't show up or route until
         restart. The front now watches models.json mtime, recomputes the
         assignment, updates its router, and pushes it to every engine via a new
         internal POST /internal/reload-config (re-reads models.json +
         set_assigned_models). /v1/models and routing now reflect add/remove live.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      79c2e44d
    • Stefy Lanza (nextime / spora )'s avatar
      admin: strict 1:1 whisper model<->runner linked by config alias == runner id · f0dcf7eb
      Stefy Lanza (nextime / spora ) authored
      The whisper-server runner's model_id is inherited from the gguf MODEL config's
      alias, which links the two. So:
      - Adding a model config creates one runner whose id is the config's alias
        (auto-minted + stamped onto the config when no alias is given).
      - Removing a config (by config_id or by path) tears down the runner whose id
        matches that config's alias — one config removed = one runner removed/killed.
      
      Replaces the interim model_config_id link.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      f0dcf7eb
    • Stefy Lanza (nextime / spora )'s avatar
      whisper: account a running runner as a loaded model for VRAM eviction · 2a214215
      Stefy Lanza (nextime / spora ) authored
      Starting a whisper-server runner loads the gguf onto the GPU, but it was
      invisible to the VRAM-eviction logic — it never evicted others to make room,
      recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.
      
      - WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
        free its VRAM like any other model.
      - MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
        other models if free VRAM is short, start the subprocess, and register it in
        models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
        trigger for eviction and an eviction candidate.
      - stop_whisper_server(): stop + clear all that accounting (frees VRAM).
      - Routed every start/stop through these: on-request transcription, engine
        startup pre-load, admin model-load (Load button) and model-unload/disable.
      
      So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2a214215
    • Stefy Lanza (nextime / spora )'s avatar
      admin: whisper gguf model auto-manages its runner (1:1) · 3d551444
      Stefy Lanza (nextime / spora ) authored
      Model a whisper gguf as two things: a MODEL config (a .gguf entry with
      backend=whisper-server and NO model_path — enables the model, holds load
      strategy, shown on the GGUF row) and a RUNNER (backend=whisper-server WITH
      model_path — the subprocess, shown in the whisper card).
      
      - Enabling a .gguf with speech_to_text marks it backend=whisper-server and
        auto-creates exactly one runner (1:1) on a free port.
      - Disabling the model removes + kills all its runners (cascade by model_path).
      - Removing a runner (or model) now stops the subprocess + drops registry
        entries, instead of leaving it running until restart.
      - cached-models shows the model config on the GGUF row but excludes runners;
        the whisper card shows only runners (require model_path).
      - engine startup only launches runners (entries with model_path), never the
        bare model config.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      3d551444
    • Stefy Lanza (nextime / spora )'s avatar
      admin: keep GGUF model config and whisper-server runner config separate · 41e3661e
      Stefy Lanza (nextime / spora ) authored
      A GGUF model config (configure + enable the model, load strategy) and a
      whisper-server config (the runner: port, gpu, which model) are two distinct
      things. Showing whisper-servers as the backing file's configs made the GGUF
      row's "Configure" open the whisper form — conflating them.
      
      Whisper-server entries are again excluded from a GGUF file's editable config
      list (they live only in their own card); the GGUF row's Configure opens the
      general model config modal. The file still reflects "loaded" via its
      model_path in the loaded-status sets.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      41e3661e
    • Stefy Lanza (nextime / spora )'s avatar
      admin: show whisper-servers as gguf configs again, edit via whisper editor · 615967d8
      Stefy Lanza (nextime / spora ) authored
      Reverted the over-correction that hid whisper-server entries from the
      backing GGUF file's config list — they should appear there (the file is
      configured through them). To avoid the duplicate-config bug, editing a
      whisper-server config from the GGUF row now routes to the whisper editor
      (which updates in place by id) instead of the general config modal. Pills
      for whisper-server configs are labelled by id so the two instances are
      distinguishable (they share the "whisper" alias).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      615967d8
    • Stefy Lanza (nextime / spora )'s avatar
      admin: fix duplicate gguf configs from whisper-server pollution · ea8dd92c
      Stefy Lanza (nextime / spora ) authored
      Editing a GGUF model's config kept appending duplicate entries. Root cause:
      api_cached_models added whisper-server entries to the backing GGUF file's
      config list (keyed by model_path). With whisper0/whisper1 both pointing at
      the file, the GGUF row's configs[0] became a whisper-server entry, which
      carries no config_id — so "Configure" on that row treated every save as a
      brand-new config and spawned a fresh duplicate each time.
      
      - cached-models: skip whisper-server entries entirely (they're managed in
        their own card; the file still shows "loaded" via its model_path key).
      - model-configure (whisper-server): update an existing entry in place when
        the id matches instead of 409-or-append, preserving unmanaged fields
        (engine, config_id).
      - model-disable: guard against whisper-server entries' path=None so a
        path-based disable can't crash on basename(None).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      ea8dd92c
    • Stefy Lanza (nextime / spora )'s avatar
      admin: stop the models page jumping to the bottom on refresh · d7636907
      Stefy Lanza (nextime / spora ) authored
      Clicking Load/Unload (or any refreshLocal) re-ran loadCachedModels, which
      blanked the HF/GGUF lists to "Loading…" every time. That collapsed the page
      height and threw the viewport to the end, so a whisper-server unload looked
      like it "did nothing and scrolled to the bottom" even though it worked.
      
      Now the "Loading…" placeholder only shows on the first (empty) load; on a
      refresh the existing rows stay in place and the scroll position is captured
      and restored around the re-render.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      d7636907
    • Stefy Lanza (nextime / spora )'s avatar
      admin: whisper-server load mode derives from the backing GGUF config · b717f1dc
      Stefy Lanza (nextime / spora ) authored
      Removed the redundant "Load mode" dropdown from both whisper-server forms
      (builder + edit modal). A whisper-server is backed by a GGUF model, so its
      load mode now derives from that GGUF's configured load_mode (default
      on-request) instead of being set separately in two places. Load/offload
      strategy stays solely in the GGUF model config.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      b717f1dc
    • Stefy Lanza (nextime / spora )'s avatar
      admin: whisper-server unload clears registry + gguf file shows loaded · 2b8e69df
      Stefy Lanza (nextime / spora ) authored
      Two whisper-server UI issues on a multi-engine node:
      
      - Unload didn't visibly take effect: a whisper-server is registered both as
        a subprocess (whisper_servers) AND in the generic .models/.model_pools
        registry under audio:<id>. The unload stopped the subprocess but left the
        registry entry, so the row kept showing "Unload". Now it also drops the
        audio:<id>/<id> registry entries (and matches by id, audio:<id>, or the
        gguf model_path, so unloading the file stops every server using it).
      
      - The backing gguf file showed "Load" while its whisper-server was running.
        Surface each running server's _model_path in the loaded-key sets
        (engine-state, model-loaded-status, status) so the GGUF-file row reflects
        that the file is in use.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2b8e69df
    • Stefy Lanza (nextime / spora )'s avatar
      admin: show models pinned to a secondary engine (whisper form + loaded) · 25cb3b80
      Stefy Lanza (nextime / spora ) authored
      /admin/api/models is served by the primary engine and ran list_models()
      with the per-engine assignment filter, so models pinned to a secondary
      engine (whisper-servers configured with engine=radeon) were dropped. That
      emptied the whisper-server config form and left those models unmatched in
      the loaded-status check, so they never showed as loaded.
      
      Add list_models(all_engines=True) to bypass _entry_assigned, and call it
      from the admin model list. Combined with the whisper-servers now reported
      in /internal/engine-state, the radeon whisper-servers reappear in their own
      form and show as loaded.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      25cb3b80
    • Stefy Lanza (nextime / spora )'s avatar
      front: include running whisper-servers in /internal/engine-state · 801d21a9
      Stefy Lanza (nextime / spora ) authored
      The supervisor polls /internal/engine-state (not /admin/api/status) to
      populate each engine's loaded_models in the registry, which the front then
      aggregates for the models page. That endpoint only read .models, so a
      whisper-server running on a secondary engine still showed as not loaded.
      Fold each running server (id + `audio:` alias) into its loaded_models.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      801d21a9
    • Stefy Lanza (nextime / spora )'s avatar
      admin: actually free VRAM on unload + show whisper-server as loaded · 84def90a
      Stefy Lanza (nextime / spora ) authored
      Two issues when unloading/reporting models on a multi-engine node:
      
      - Unload didn't free VRAM for pooled models. api_model_unload only popped
        multi_model_manager.models and never touched model_pools, so a model
        served with max_instances>1 (which lives only in the pool) kept all its
        instances resident. Now it searches both dicts and calls unload_model(),
        which cleans up the whole pool + runs gc/empty_cache. Also handles
        whisper-server models (their own subprocess) by stopping the server.
      
      - whisper-server showed as "not loaded". It runs as a subprocess tracked
        in whisper_servers, not in .models. Fold each running server (id +
        `audio:` alias) into both the model-loaded-status list and the
        /admin/api/status loaded_keys, so the models page, dashboard count and
        per-engine box all reflect it (incl. on a secondary engine).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      84def90a
    • Stefy Lanza (nextime / spora )'s avatar
      front: route admin model load/unload to the owning engine · 8abd66c7
      Stefy Lanza (nextime / spora ) authored
      model-load/model-unload were proxied to the primary engine, so unloading
      (or loading) a model that lives on a secondary engine hit the wrong process
      and silently no-op'd (was_loaded=False). Add front-proxy interceptors:
      
      - unload: find the engine whose loaded_models matches the path and forward
        the request there; fall back to the primary.
      - load: reuse an engine already serving the model, else the model's engine
        pin from models.json, else the primary.
      
      Registered before the catch-all proxy, mirroring /admin/api/engines.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      8abd66c7
    • Stefy Lanza (nextime / spora )'s avatar
      admin: show per-engine loaded models (tasks hover + cross-engine model page) · 5fdbfc54
      Stefy Lanza (nextime / spora ) authored
      - Tasks page: the engine box "N models" count now has a hover tooltip
        listing every loaded model key on that engine (dotted underline + help
        cursor); "no models loaded" when empty.
      - Models page: models loaded on a non-primary engine were shown as idle.
        /admin/api/model-loaded-status is served by the primary engine and only
        reported its own pool. Added a front-proxy interceptor that proxies to
        the primary then unions in every other engine's loaded_models, mirroring
        the existing /admin/api/engines and /admin/api/status aggregation.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      5fdbfc54
    • Stefy Lanza (nextime / spora )'s avatar
      oci: run script can enable/disable bundled demo tool web UIs · bce43398
      Stefy Lanza (nextime / spora ) authored
      The image's demo tool UIs (video editor :8420, videogen :7790, township :7788,
      parler TTS :8123) are now individually toggleable at container start — no
      rebuild. supervisord.conf drives each program's autostart from a CODERAI_TOOL_*
      env var; the entrypoint seeds defaults (three UIs on, parler off) so the
      %(ENV_...)s expansion always resolves. run_oci.sh (shipped as the dist
      coderai-docker runner) gains:
      
        --no-tools            disable all three demo UIs
        --enable-tool NAME    force-enable one (also turns on parler)
        --disable-tool NAME   disable one; explicit toggles override --no-tools
      
      NAME ∈ {video-editor, videogen, township, parler}. README + --help updated.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      bce43398
    • Stefy Lanza (nextime / spora )'s avatar
      download: auto-retry Xet-only large blobs with hf_xet instead of failing · 6517dfbe
      Stefy Lanza (nextime / spora ) authored
      coderai disables hf_xet by default (it bypasses the tqdm progress hook and can
      segfault the worker), but some HF blobs are Xet-only and the plain HTTPS path
      refuses them with "file too large … install hf_xet" — even though hf_xet is
      bundled. The first pass now holds that error instead of surfacing it, detects
      the Xet-required message, and transparently retries with Xet enabled
      (force_xet → HF_HUB_DISABLE_XET=0). Non-Xet errors are surfaced as before; the
      existing crash→disable-Xet retry is unchanged. _attempt now returns the held
      error message as a 4th tuple element.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      6517dfbe
    • Stefy Lanza (nextime / spora )'s avatar
      ds4: auto-downloaded weights land in coderai GGUF cache + show on models page · ef106ba1
      Stefy Lanza (nextime / spora ) authored
      When ds4.auto_download is enabled and a deepseek4 request resolves no local
      GGUF, the downloaded weight variant is now relocated into coderai's GGUF cache
      (get_model_cache_dir; move on same FS, symlink across devices) and registered
      in models.json as a text_models entry that mimics the requested ("failed")
      model's config — backend auto, on-request, enabled and visible (removed from
      unloaded/to_download). model_name is threaded ds4 backend → ensure_service →
      ensure_model so the registration mirrors the right entry.
      
      Also: settings "Extra ds4-server args" hint/placeholder updated to reflect the
      auto --kv-disk-dir and SSD-streaming expert-cache sizing
      (--ssd-streaming-cache-experts), noting Q2_K can fail ds4's CUDA prefill.
      
      Diagnosis (no code change): ds4-server's "cuda prefill failed" on the 93GB
      Q2_K variant is a quant-specific ds4 CUDA bug — the 154GB Q4_K completes
      prefill fine (verified: "prompt done 434s" vs Q2_K instant failure), with
      15.8GB VRAM free either way (not OOM, not cache budget, not coderai).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      ef106ba1
    • Stefy Lanza (nextime / spora )'s avatar
      fix(ds4): give ds4 models exclusive VRAM (evict others) to stop expert-cache starvation · 00e21ea5
      Stefy Lanza (nextime / spora ) authored
      ds4-server streams MoE experts and wants the whole GPU for its expert cache, but
      coderai's modest VRAM estimate for a ds4 model let it co-reside another model —
      starving the cache so ds4's layer-0 FFN expert encode failed ("gpu layer 0 ffn
      batch encode failed"). When loading a ds4 model on demand, unload all other models
      first so ds4-server gets the full GPU.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      00e21ea5
    • Stefy Lanza (nextime / spora )'s avatar
      feat(auto-compact): guarantee last message, chunked summarize, signal-if-too-big · 8bfd0855
      Stefy Lanza (nextime / spora ) authored
      - Always keep the CURRENT request (last message) intact and as the very last
        message after compaction (the compacted history/summary precedes it).
      - summarize strategy now CHUNKs the older history and summarizes map-reduce
        (per-chunk then a combined pass) so the summarization prompt can't itself
        overflow.
      - If compaction still can't fit the window (e.g. a single huge final message),
        return HTTP 400 "request too big for context" instead of failing mid-generation.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      8bfd0855
    • Stefy Lanza (nextime / spora )'s avatar
      feat: per-model auto-compact of the conversation context (off by default) · a019905f
      Stefy Lanza (nextime / spora ) authored
      When enabled for a model, if the prompt would exceed auto_compact_pct% of the
      model's context window, the conversation is shrunk to ~65% before generation
      instead of erroring on overflow. Per-model config (auto_compact / auto_compact_pct
      / auto_compact_strategy) with three strategies:
        - drop_oldest    : keep system messages + the most recent turns that fit.
        - keep_head_tail : also keep the first user turn as an anchor + a count note.
        - summarize      : replace the dropped middle with a best-effort LLM summary
                           (generated by the loaded model; falls back to a count note).
      
      Token size is a cheap chars/4 estimate; membership uses object identity so
      value-equal turns don't collide. Wired into the chat path (codai/api/text.py),
      the model-configure whitelist, and the model config modal UI.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      a019905f
    • Stefy Lanza (nextime / spora )'s avatar
      fix(ds4+config): resolve bare model ids, don't over-estimate VRAM, robust config · 8c85e16a
      Stefy Lanza (nextime / spora ) authored
      - ds4: resolve a bare/aliased model id (e.g. "Foo-ds4-Q2_K", no path/extension) to
        its configured .gguf via a config/cache-aware resolver — fixes the 503 ("no local
        deepseek4 GGUF resolved") on chat requests (only "Load now" with a full path
        worked before). Ds4Backend reuses the same resolver.
      - ds4: report a modest VRAM footprint for ds4 models (measured or ~12GB) instead of
        the 100GB+ GGUF size — ds4-server streams experts from SSD and manages its own
        memory, so the old estimate forced needless ~128GB eviction churn every request.
      - ds4: route on-disk KV checkpoints into coderai's offload directory by default
        (--kv-disk-dir <offload>/ds4-kv) unless overridden in extra_args.
      - config: tolerant load (_dc drops unknown keys) so a stale/newer config.json never
        crashes the whole load and silently resets ALL settings to defaults (the "had to
        reconfigure everything" bug). save_config + GET/POST settings carry the new ds4
        fields (model_path, auto_download, ssd_streaming).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      8c85e16a
    • Stefy Lanza (nextime / spora )'s avatar
      feat(ds4): auto-route deepseek4 GGUFs by architecture; serve the requested file · 6a153c58
      Stefy Lanza (nextime / spora ) authored
      - Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read
        from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs
        stay on llama.cpp; the model_id alias still routes for the download case.
      - ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a
        local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the
        managed service per file). No fixed-variant assumption.
      - Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx).
      - New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream
        MoE experts from SSD/disk), model_path (explicit -m override), and
        auto_download (OFF by default — only serve GGUFs already present; error clearly
        instead of silently pulling tens of GB; opt in to fetch model_variant).
      - AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml
        ops) → ds4 for now; and ds4 routing/offload/text-only specifics.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      6a153c58
    • Stefy Lanza (nextime / spora )'s avatar
      fix: tool-call streaming/format robustness + clear over-context error · 3834ecf5
      Stefy Lanza (nextime / spora ) authored
      - Streaming tool gate now withholds the gemma/qwen native `<|tool_call>` marker
        (and partials) too, not just `<tool_call>`/`call:NAME{` — so the raw marker no
        longer leaks to the client mid-stream (Kilo was executing partial calls).
      - Normalize tool-call function.arguments from JSON string → dict before applying
        the chat template, so templates that render `arguments|items` (Qwen) don't
        raise "Can only get item pairs from a mapping".
      - Context-window overflow now returns a meaningful error: a structured SSE error
        event (code context_length_exceeded) when streaming, or HTTP 400 with a clear
        message for non-streaming — instead of injecting "[Generation error: …]" as
        assistant content (which polluted chat history).
      - Models page: unconfigured GGUF files now expose the "Free disk" button (records
        them as "to download" before deleting), matching HF models.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      3834ecf5
    • Stefy Lanza (nextime / spora )'s avatar
      fix: GGUF vision/mmproj routing + VRAM estimate; Tasks page it/s + history · ade800f9
      Stefy Lanza (nextime / spora ) authored
      - api_model_load: load a GGUF/text model via llama.cpp even when it's also
        bucketed under image/vision (respect the entry's primary model_type), so a
        gemma+mmproj LLM never hits the diffusers from_pretrained() path.
      - model config save: a GGUF LLM with an mmproj auto-gets the image_to_text
        capability and is kept out of the diffusers vision_models/image_models buckets.
      - VRAM estimate: _runtime_reserve_gb scales the KV-cache reserve by the cache
        quantization (q4_0 ≈ 0.27× f16) so quantized-KV models at large context aren't
        over-estimated into needless CPU offload.
      - Free disk (HF): quiet huggingface_hub's noisy not-found traceback and make the
        delete idempotent (repo already gone = success).
      - Tasks page: generation tasks now report it/s (or s/it when slow); text keeps
        tok/s. Throughput computed centrally in the task registry (live EMA + run
        average on finish). New "Recent tasks (last 10)" history section.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      ade800f9
    • Stefy Lanza (nextime / spora )'s avatar
      packaging: build a self-contained distribution bundle by default · 7d3d8e5b
      Stefy Lanza (nextime / spora ) authored
      - make_dist_bundle.sh: assemble dist/coderai-docker-dist.tar containing the image
        tarball (docker save | pigz), the coderai-docker runner (run_oci.sh, image tag
        pinned), install.sh and README. Stages under dist/ (not tiny /tmp) and hardlinks
        the multi-GB image tarball instead of copying it.
      - dist-bundle/install.sh: docker-load the image (sudo-fallback for daemon access)
        then install the coderai-docker runner to /usr/local/bin (root) or
        ~/.local/usr/bin (user, added to ~/.bashrc PATH if missing).
      - build_oci_image.sh: after a successful build, export + bundle for distribution
        by default (--no-dist to skip).
      - run_oci.sh: default image tag -> coderai:dist (matches what's shipped/loaded).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      7d3d8e5b
    • Stefy Lanza (nextime / spora )'s avatar
      Merge feat/township-match-upload: to-download list, mmproj vision, styled... · 766fef3c
      Stefy Lanza (nextime / spora ) authored
      Merge feat/township-match-upload: to-download list, mmproj vision, styled modals, broker + packaging
      766fef3c
    • Stefy Lanza (nextime / spora )'s avatar
      feat: model "to-download" list, mmproj vision, styled modals, broker + packaging · cbf7f147
      Stefy Lanza (nextime / spora ) authored
      Web UI / models:
      - "To download" wishlist: models known but not on disk and not configured show
        as non-configured to-download rows. Free-disk on an unconfigured model, Remove
        on a model with no files left, and a new "Add to list" button in the download
        window all record into models.json `to_download`; pruned on enable/download.
        New endpoints model-mark-download / model-unmark-download.
      - mmproj multimodal components: mmproj GGUFs are classified as components (not
        models), selectable per-GGUF in the model config (auto-selected, enables vision
        capability). VulkanBackend loads them via llama.cpp's MTMDChatHandler (--mmproj
        equivalent), and the chat path now forwards image_url content end-to-end.
      - All window.alert() replaced by a shared styled showAlert()/showConfirm() modal
        in base.html (used across every admin template).
      
      Front proxy / broker:
      - Fix engine model-assignment NameError (keep -> _keep).
      - Brokered GET /coderai/capabilities now answers from the front (whole node) so
        multi-GPU hosts report every card, not a single engine's CUDA-visible one.
      - Log a clear reason when the broker is disabled.
      
      Packaging (distributable OCI image):
      - Multi-stage venv image + smoke test; bundle ds4/wav2lip/sadtalker + parler;
        whisper-server etc. dereferenced (cp -aL) so no dangling symlinks.
      - Dockerfile.update + update_oci_image.sh: ~30s incremental code-only rebuild on
        an immutable coderai:base (no 20GB bundle recopy).
      - run_oci.sh: --local/--config-dir + --map to run against existing local config
        and data dirs without a rebuild; --debug[=flags] + --log-file for selectable
        debug flags and a host-tailable file log (launcher tees; supervisord kills the
        process group). tmp_janitor age-prunes the dedicated temp dir.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      cbf7f147
  2. 18 Jun, 2026 5 commits
    • Stefy Lanza (nextime / spora )'s avatar
      Almost all ready · 9d023ec2
      Stefy Lanza (nextime / spora ) authored
      9d023ec2
    • Stefy Lanza (nextime / spora )'s avatar
      quant: reject checkpoints whose weights weren't actually quantized · c741ff5b
      Stefy Lanza (nextime / spora ) authored
      GPTQModel silently leaves layers it can't map (e.g. gemma-4's fused batched MoE
      experts) in bf16, producing a near-full-size "checkpoint" that the loader would
      redirect to and then offload. The worker now scans the saved safetensors and, if
      <50% of large weight bytes are int-packed, deletes the output and marks the job
      failed (so it falls back to bitsandbytes) instead of reporting "done".
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      c741ff5b
    • Stefy Lanza (nextime / spora )'s avatar
      quant: surface jobs on Tasks page + model list, persist across restart · 6d053dc1
      Stefy Lanza (nextime / spora ) authored
      - quant jobs now appear on the Tasks page (api_tasks emits kind=quantize) and as
        a live badge on the HF model-list row (polled; re-renders only on change).
      - persist job state to <cache>/quantized/jobs.json; on startup a job left
        "running" is marked "interrupted" only if its owning PID is dead (merge-safe
        save so multiple processes don't clobber each other).
      - gitignore the runtime model cache (models/), logs/, and the third-party
        GPTQModel/ source clone (installed into the venv, not part of this repo).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      6d053dc1
    • Stefy Lanza (nextime / spora )'s avatar
      front: timestamped logs + AMD GPU marketing-name detection · 48be0d91
      Stefy Lanza (nextime / spora ) authored
      - Prefix front/uvicorn and re-emitted engine log lines with [HH:MM:SS] so the
        front log format matches the engine ([HH:MM:SS][nvidia] …); preserve tqdm
        in-place progress and avoid double-timestamping already-tagged lines.
      - gpu_detect: _amd_gpu_name() resolves a card's marketing name via amdgpu
        product_name sysfs, then lspci board/chip name, then vulkaninfo.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      48be0d91
    • Stefy Lanza (nextime / spora )'s avatar
      feat: smart context caching, VRAM offload fix, GPTQ/AWQ quant backend · 990f9471
      Stefy Lanza (nextime / spora ) authored
      Smart context caching (both text backends):
      - Per-instance generation lock so pooled concurrent requests can't corrupt a
        shared KV cache (GGUF + HF, incl. streaming worker thread).
      - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB).
      - HF: replace single exact-text KV slot with an LRU of token-prefix slots +
        token-level longest-common-prefix + DynamicCache clone/crop (handles
        mid-history edits); kv_cache_slots (default 3).
      - Session-affinity routing in ModelInstancePool.acquire(session_key); key from
        user/X-Session-Id else a stable prefix hash.
      - RAM-pressure ladder drops reclaimable prefix caches before evicting models.
      
      VRAM fix:
      - Auto-fit check no longer double-counts the KV/activation reserve when
        expected_vram_gb is already a peak estimate — borderline models (e.g.
        gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing
        device_map offload.
      
      GPTQ/AWQ fast-kernel quant backend (HF path):
      - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint
        cache, on-demand background quantize job (falls back to bnb if unsupported).
      - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized
        checkpoint with Marlin/ExLlama when present, else bitsandbytes.
      - Admin endpoints + "Quantize to 4-bit" button with live status on the model page.
      - requirements-nvidia.txt documents the from-source install + numpy caveat.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      990f9471