1. 20 Jun, 2026 19 commits
    • Stefy Lanza (nextime / spora )'s avatar
      backend: per-model kv_offload flag to keep the KV cache in host RAM · 4754beff
      Stefy Lanza (nextime / spora ) authored
      Large contexts make the KV cache huge (a 256k q4_0 cache is several GB), which
      won't fit in VRAM alongside the weights. llama.cpp can't page KV to disk, but it
      can keep it in system RAM via --no-kv-offload. Expose that as a per-model
      kv_offload flag (default unchanged = KV in VRAM): set kv_offload=false to pass
      offload_kqv=False to llama.cpp, freeing VRAM for big contexts at the cost of
      slower decode (KV ops cross PCIe). Also allow the key in the admin model-config
      endpoint so it's persistable from the UI.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      4754beff
    • Stefy Lanza (nextime / spora )'s avatar
      version: mark CoderAI 0.1.0 and surface it in the admin web UI · da4359c3
      Stefy Lanza (nextime / spora ) authored
      Add a canonical codai.__version__ = "0.1.0" as the single source of truth (kept
      separate from Config.version, which is the config-schema/migration version). The
      admin template renderer injects it as coderai_version, and base.html shows it as a
      small "v0.1.0" pill next to the CoderAI logo on every page.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      da4359c3
    • Stefy Lanza (nextime / spora )'s avatar
      text: stop runaway tool-call loops + honor client repetition penalties · a535c27f
      Stefy Lanza (nextime / spora ) authored
      Some quantized fine-tunes (seen with an "Aggressive" Qwen3.6-35B Q4_K_M) collapse
      into a runaway repetition loop — emitting a malformed parallel tool-call flood of
      1700+ tokens that never terminates — when top_p=1.0 and no repetition penalty are
      in effect (exactly the conditions Qwen's own docs warn cause endless repetitions).
      
      Two fixes:
      
      1. Anti-loop generation stop in stream_chat_response: a model-agnostic detector
         normalises away the variable parts of the tail (quoted strings, filesystem
         paths, whitespace) so a loop whose only per-cycle difference is an arg/path
         still reads as periodic, then breaks generation when a short structural unit
         repeats >=5x back-to-back. Tuned to not trip on prose, repetitive code, or a
         legit handful of distinct tool calls.
      
      2. Honor client-supplied repetition controls. The chat paths previously forwarded
         only temperature/top_p, silently dropping repeat/presence/frequency penalty —
         so a caller (e.g. Kilo) setting them per-model had no effect. Plumb them through
         generate_chat_stream / generate_chat to both backends (cuda already accepts
         them; vulkan now does too) with graceful signature fallbacks. Defaults are
         no-ops, so unset clients are unaffected.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      a535c27f
    • Stefy Lanza (nextime / spora )'s avatar
      text: make auto-compaction actually fire — fix config lookup + max_tokens-aware layered trimming · 913e283a
      Stefy Lanza (nextime / spora ) authored
      Auto-compaction never triggered: multi_model_manager.config stores the
      whitelisted build_runtime_kwargs() dict, which drops the per-model
      auto_compact* keys (they survive only under _raw_cfg), so _resolve_compaction
      always read the global default (False) and returned None. Read the keys via a
      _raw_cfg fallback so per-model compaction config is honoured.
      
      Also rework the over-context handling to count the reply reservation, since the
      reply is generated into the same window (prompt + max_tokens <= n_ctx). Four
      layers, cheapest first:
        1. fits as-is              -> nothing
        2. overflow within tol     -> trim max_tokens to fit (lossless)
        3. beyond tol & big prompt -> compact history (drop/summarize)
        4. single message too big  -> slice it (summarize its middle, keep head/tail)
      
      The chars/4 estimate undercounts token-dense code/JSON, so trimming to the exact
      n_ctx edge could still overflow; inflate the estimate by a configurable
      estimate_safety (default 1.15) for all physical-fit decisions.
      
      New CompactionConfig knobs (per-model overridable): tolerance_pct (20),
      min_output (512), estimate_safety (1.15). Effective max_tokens is threaded back
      to both the streaming and non-streaming generation paths.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      913e283a
    • Stefy Lanza (nextime / spora )'s avatar
      front: drain in-flight requests before bouncing an engine · 34d666d6
      Stefy Lanza (nextime / spora ) authored
      An engine restart (admin button / config change) previously SIGTERM'd the
      process immediately, severing any active SSE stream mid-response — the client
      saw httpcore.RemoteProtocolError "peer closed connection without sending
      complete message body".
      
      Now restart_engine marks the engine `draining` first: the router stops routing
      NEW requests to it (Engine.is_alive() reports false while draining, and the poll
      loop can't flip it back healthy), and the supervisor waits up to
      server.engine_restart_drain_grace seconds (default 30, 0 = immediate) for the
      in-flight count to reach zero before killing the process. Stragglers past the
      grace window are still bounced.
      
      In-flight is tracked per engine in the front proxy: proxy() increments on send
      and decrements once the streamed response is fully drained (or the send failed).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      34d666d6
    • Stefy Lanza (nextime / spora )'s avatar
      text: surface model reasoning as a separate field (think/thinking/thought) · 0a7d343a
      Stefy Lanza (nextime / spora ) authored
      Qwen-style chat templates pre-fill the opening <think> in the prompt, so the
      model emits only the reasoning body + a bare closing </think> — and they think
      by DEFAULT regardless of the API enable_thinking flag. The old paired-tag
      reasoning extractor missed the bare close, leaking the whole thought (and the
      </think>) into content and conversation history.
      
      - extract_reasoning_content: handle a bare </think|/thinking|/thought> with no
        opening tag (treat the prefix as reasoning).
      - streaming: a chunk-safe reasoning gate routes the thought into
        delta.reasoning / reasoning_content until </think>, then flips to content;
        tool extraction runs on the post-</think> answer only.
      - non-streaming: extract reasoning, set message.reasoning(+_content), clean
        content; tools parsed from the answer.
      - activate whenever the model auto-thinks (qwen3/qwq/deepseek-r1/… name) OR
        reasoning is explicitly enabled — not just on the API flag.
      - configurable suppression: per-model `suppress_reasoning`, or per-request via
        the standard reasoning:{exclude:true} / reasoning_effort:"none" /
        suppress_reasoning fields. Emits both `reasoning` and DeepSeek-style
        `reasoning_content` for client compatibility.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      0a7d343a
    • Stefy Lanza (nextime / spora )'s avatar
      packaging: add lm-sensors to the OCI runtime images · 54a83db0
      Stefy Lanza (nextime / spora ) authored
      Optional CPU-temperature source for thermal control. Not essential — the
      thermal monitor reads CPU temp from psutil and the kernel's /sys/class/thermal
      zones first (both work in-container); `sensors` is only a last-resort text
      parse for hosts whose sysfs doesn't expose a CPU zone. Added to both the
      from-venv and from-scratch runtime stages for completeness.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      54a83db0
    • Stefy Lanza (nextime / spora )'s avatar
      parser: don't amplify degenerate <tool> spam from too-low quants · 0b7262ae
      Stefy Lanza (nextime / spora ) authored
      The plaintext <tool> rescue could turn a failing 2-bit model's repetition
      loop (<tool>glob</tool><tool>glob</tool>… / bare names, no args) into a flood
      of bogus tool calls. Harden it: reject a batch with >6 <tool> blocks (that's
      model degeneration, not many real calls) and drop any bare <tool>name</tool>
      that carries no key: value argument (the spam signature). Genuine single/few
      calls with arguments still parse; combined with the existing trailing-action,
      declared-name, and DeepSeek-only scoping.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      0b7262ae
    • Stefy Lanza (nextime / spora )'s avatar
      ds4: cache-cleanup safety, in-flight gate, default-model downloader, low-quant tool parse · e23dd2a7
      Stefy Lanza (nextime / spora ) authored
      - ds4 kv janitor: a checkpoint is deleted only when ALL hold — untouched by
        max(mtime, atime) for the age (so a checkpoint ds4 merely READS, which bumps
        atime not mtime, is spared); not currently open (fd/mmap) by a ds4-server;
        and ds4 is not serving any request. New in-flight counter on Ds4Backend
        (any_request_active) gates the sweep.
      - settings: "Download a default DeepSeek V4 model" — select + button backed by
        new /admin/api/ds4/default-models catalog (q2-imatrix / q2-q4 / q4 / mtp from
        antirez/deepseek-v4-gguf). Reuses the normal downloader, which flattens the
        gguf into the cache and surfaces it in the model list; live progress.
      - parser: rescue the degraded plaintext <tool>name arg: value</tool> form that
        heavy quants (ds4 q2-imatrix) emit when they can't reproduce DSML. Scoped to
        DeepSeekParser only (never the shared ToolCallParser, so other families are
        untouched), requires a DECLARED tool name, plaintext-only inner, and the
        block(s) to be the message's trailing action — so a <tool> example inside a
        prose reply is not misread as a call.
      - settings: corrected ds4 perf note (i-quants/Q2_K fail CUDA prefill; use Q4_K+).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      e23dd2a7
    • Stefy Lanza (nextime / spora )'s avatar
      fix: ds4 kv-janitor loop var shadowed `import threading as _t` · 681b19e0
      Stefy Lanza (nextime / spora ) authored
      The --kv-disk-dir parse loop used `for _i, _t in ...`, making `_t` a local
      in main() and shadowing the module-level `import threading as _t`, so the
      later `_t.Thread(...)` raised UnboundLocalError and crash-looped both
      engines. Renamed the loop var to `_tok`.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      681b19e0
    • Stefy Lanza (nextime / spora )'s avatar
      ds4: on-disk KV-cache age cleanup + HF search on Enter · ce9c2943
      Stefy Lanza (nextime / spora ) authored
      - ds4: configurable janitor that age-prunes the on-disk KV-checkpoint cache
        (--kv-disk-dir, defaulted to <offload>/ds4-kv). New Ds4Config fields
        kv_cache_cleanup_enabled / kv_cache_max_age_hours (7d) /
        kv_cache_cleanup_interval_minutes (6h); new codai/api/ds4_kv_janitor.py
        reuses the tmp_janitor sweep (newest-mtime, so active sessions are spared),
        started from main.py only when ds4 + cleanup are both on. Settings UI +
        get/save wired.
      - ds4: corrected the perf note — i-quants (IQ2/IQ3) and Q2_K load but fail
        ds4's CUDA prefill (gpu layer 0 ffn batch encode failed → empty reply);
        use K-quants Q4_K and up.
      - models: pressing Enter in the HuggingFace search field now runs the search.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      ce9c2943
    • Stefy Lanza (nextime / spora )'s avatar
      front: configurable compaction model + live progress to the client · 1ad76dfb
      Stefy Lanza (nextime / spora ) authored
      Auto-compaction can now summarize with a DIFFERENT model than the one
      serving the request, with a global default (config.json `compaction`) and
      a per-model override (models.json `auto_compact_model`). Empty = the
      request's own model, as before.
      
      - config: new CompactionConfig (enabled/pct/strategy/model) + round-trip
      - text.py: resolve effective settings (per-model over global), resolve the
        summarizer LAZILY (only when actually over threshold, so a separate model
        isn't loaded on every request); map-reduce the dropped history into chunks
        sized to the CHOSEN summarizer's own context, reducing iteratively until
        it fits; stream status + live per-chunk progress to the client as content
        deltas (queue-bridged from the summarizer's callback)
      - admin: global compaction card (settings) + per-model summarizer dropdown
        (models, shown only for the summarize strategy)
      
      Raw two-pass path is skipped (prompt is built from system + last user turn).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      1ad76dfb
    • Stefy Lanza (nextime / spora )'s avatar
      ui: document ds4 performance setup in global + per-model settings · 2b2bdc72
      Stefy Lanza (nextime / spora ) authored
      Add a "tuning ds4 for performance" note to the global ds4 Settings section and a
      condensed inline note to the per-model ds4 streaming section on the Models page:
      NVMe/SSD placement (≈10× prefill), capping the expert cache by count, sizing the
      VRAM reserve, DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512, avoiding DS4_CUDA_WEIGHT_CACHE,
      and that decode of a model larger than VRAM is streaming-bound (smaller quant is
      the real fix). Distilled from measured tuning on the 154GB DeepSeek-V4 MoE.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2b2bdc72
    • Stefy Lanza (nextime / spora )'s avatar
      text: stop DeepSeek V4 DSML tool markup leaking during streaming · 6e20c8a3
      Stefy Lanza (nextime / spora ) authored
      The streaming tool-content gate withheld <tool>/<|tool_call>/call: markers but
      not DeepSeek V4's native <|DSML|tool_calls>… block (| = U+FF5C), so during a
      streamed tool call the raw markup reached the client token-by-token as visible
      content (even though the post-stream parser extracted the tool_calls correctly).
      
      _gate_tool_content now withholds everything from the first <|DSML| marker to the
      end (dropped on final, surfaced as structured tool_calls), and the trailing-
      partial hold list includes the DSML open tag so a marker split across chunks
      doesn't leak its leading chars.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      6e20c8a3
    • Stefy Lanza (nextime / spora )'s avatar
      ds4: per-model launch overrides, multibyte-safe streaming, UI tweaks · 6a111627
      Stefy Lanza (nextime / spora ) authored
      Per-model ds4 tuning (these vary by quant/size/context, so they belong on the
      model, not globally):
      - Optional `ds4` block on a model entry overrides the global Ds4Config for
        ssd_streaming / expert_cache_reserve_gb / extra_args / extra_env; unset fields
        inherit the global config (the default/template). Ds4Backend looks up its own
        model entry and applies the overrides via dataclasses.replace.
      - admin: api_model_configure accepts + normalizes the per-model `ds4` block,
        dropping it when empty.
      - models page: a "ds4 streaming" section shown only when ds4 is enabled globally
        and the model is a deepseek4; n_ctx stays the context knob.
      
      Fix garbled / truncated ds4 replies: the streaming reader used
      iter_lines(decode_unicode=True), which decodes each network chunk independently
      and corrupts a multibyte UTF-8 char split across chunks ('—' -> 'â'); the broken
      JSON then made json.loads fail and the token was silently dropped (truncated
      tails). Parse the SSE byte stream and split on the b"\n" byte (never inside a
      UTF-8 sequence), decoding whole lines; also flush a final newline-less line.
      
      UI: slow-reply notice reworded to "Waiting for model reply..." with a trailing
      newline so the real reply starts on its own line.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      6a111627
    • Stefy Lanza (nextime / spora )'s avatar
      parser: handle DeepSeek V4 DSML tool calls; reword waiting message · 7f39ce8f
      Stefy Lanza (nextime / spora ) authored
      DeepSeek V4 (ds4) emits native tool calls as <|DSML|invoke name="…">
      <|DSML|parameter name="p" string="true">val</|DSML|parameter></|DSML|invoke>
      (the | is U+FF5C). No parser recognised this, so ToolCallParser returned None
      and the raw markup leaked to the client as content even though ds4 reported
      finish=tool_calls.
      
      - parse_deepseek_dsml_tool_calls(): extract (name, args); string="false" params
        are JSON-decoded, others kept as strings; ASCII | tolerated.
      - Wired into DeepSeekParser and ToolCallParser.extract_tool_calls (the live path).
      - strip_dsml_tool_calls(): drop the DSML block from displayed content in both
        strip_tool_calls_from_content paths. Guarded by 'DSML' in text -> no effect on
        other models.
      
      Also reword the slow-reply notice from "Waiting for model to load..." to
      "Waiting for model reply..." (the model is usually loaded, just slow).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      7f39ce8f
    • Stefy Lanza (nextime / spora )'s avatar
      ds4: configurable CUDA env knobs (expert-cache reserve + free-form extra_env) · 7fc393d4
      Stefy Lanza (nextime / spora ) authored
      ds4-server exposes several CUDA tunables only via environment, not CLI flags.
      By default ds4 reserves half the card for non-cache use and allocates the model
      weight arena in 1792 MiB chunks — both starve / OOM the streaming expert cache
      on small-weight MoE models served from SSD.
      
      Pass an explicit env to ds4-server (Popen now sets env=) with:
        - expert_cache_reserve_gb: typed knob -> DS4_CUDA_STREAMING_EXPERT_CACHE_RESERVE_GB
          (0 = leave ds4's default).
        - extra_env: free-form KEY=VALUE passthrough for the rest, e.g.
          DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512 to shrink the weight-arena chunk so it
          fits a heap fragmented by the expert cache.
      
      Both surfaced in Settings (config + admin GET/POST + UI), default to no-op so
      behaviour is unchanged unless set.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      7fc393d4
    • Stefy Lanza (nextime / spora )'s avatar
      packaging: ship a markdown README.md in the docker dist bundle · bb4cd8db
      Stefy Lanza (nextime / spora ) authored
      Add a GitHub-flavored README.md alongside the existing README.txt in the
      all-in-one docker distribution bundle, and have make_dist_bundle.sh stage it
      so future builds include both.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      bb4cd8db
    • Stefy Lanza (nextime / spora )'s avatar
      front/ui: stricter mmproj auto-pair; show loaded model ids, not aliases · 597f8b83
      Stefy Lanza (nextime / spora ) authored
      - models page: the multimodal-projector select now defaults to None unless a
        projector is a strong, unambiguous name match. Scores only distinctive tokens
        (drops generic words + quant tokens, keeps size tokens like 14b), requires
        covering at least half the model's tokens, and rejects ties. Stops a lone
        shared family token from pairing the wrong-size projector.
      - task page: the per-engine loaded-model hover now lists each model once by its
        canonical id instead of its aliases (auto gguf stem, explicit alias, type
        prefix). engines_list() resolves loaded keys via the pin index's new model_id.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      597f8b83
  2. 19 Jun, 2026 21 commits
    • Stefy Lanza (nextime / spora )'s avatar
      front: per-model engine_fallback option for an unavailable pin · 84d085d7
      Stefy Lanza (nextime / spora ) authored
      By default a per-model engine pin is a hard constraint: if the pinned
      engine is down/incompatible the request fails (no duplicate on another
      card). Add an `engine_fallback` model-config flag (admin form checkbox +
      persisted in models.json) that opts into the old behaviour — fall back to
      a compatible engine when the pin can't be honoured. A pinned engine
      that's merely busy-but-alive is still routed to (queues) in both modes;
      fallback only applies when it's actually down or can't serve the model.
      
      Threads pin_fallback through pick_engine; the front reads engine_fallback
      via _load_pins/_model_info.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      84d085d7
    • Stefy Lanza (nextime / spora )'s avatar
      front: route to a pinned/owning engine that's busy, not a duplicate elsewhere · caa051b4
      Stefy Lanza (nextime / spora ) authored
      An engine mid-generation is GIL-blocked and fails the health poll, so it
      reads as unhealthy. pick_engine required e.healthy at every step, so a
      second request for a model pinned to that engine fell through to the
      least-loaded engine — which loaded a DUPLICATE copy (and ignored the
      model's configured n_ctx, e.g. 2048 vs 32000 → "exceeds context window").
      
      Honour the pin (and the assigned owner) when the engine is alive but
      transiently busy: route there so the request queues on its gen-lock and
      the owner handles serialization/eviction. Only fall back to another engine
      when the owner's process is actually dead. Adds Engine.is_alive() (process
      liveness) and registry.engine_owning() (health-agnostic owner lookup).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      caa051b4
    • Stefy Lanza (nextime / spora )'s avatar
      parser: stop loose JSON parser from hanging on malformed tool args · cf50ab84
      Stefy Lanza (nextime / spora ) authored
      The gemma loose object/array parser could spin forever: when a recursive
      value parse can't advance past a stray delimiter (e.g. '}' where ']' was
      expected, as in the broken `{"files":[{"path":"x"]}}` a looping Gemma
      finetune emits), the array/object loop kept iterating without consuming
      input. parse_gemma_native_tool_calls and the new parse_tool_tag_json_calls
      both feed model output through this parser, so a malformed tool call would
      hang the request (not just be missed). Add a forward-progress guard to
      both loops: bail when an iteration consumes no input. Best-effort recovers
      the tool name + good fields from malformed JSON; clean input is unaffected.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      cf50ab84
    • Stefy Lanza (nextime / spora )'s avatar
      front: keep an engine's models listed while it's mid-load · 1c57629b
      Stefy Lanza (nextime / spora ) authored
      collect_models unioned only registry.healthy() engines and re-fetched
      each engine's /v1/models live. An engine loading a model is GIL-blocked
      and misses the 2s health poll, so it goes "unhealthy" and ALL its models
      — including a freshly-added one — drop out of the aggregated /v1/models
      until the load finishes. A client (e.g. the kilo model script) polling
      during a load then sees models vanish. Cache each engine's last-good
      /v1/models and, when it's transiently unhealthy/unreachable, serve that
      cached list instead of dropping it. The models are still assigned to the
      engine and will serve once it's free.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      1c57629b
    • Stefy Lanza (nextime / spora )'s avatar
      parser: detect <tool>{json}<tool> tool calls (Gemma finetunes) · fc828989
      Stefy Lanza (nextime / spora ) authored
      Some Gemma finetunes emit tool calls as <tool>{"name":..,"arguments":..}
      using <tool> (or <tool_call>) as BOTH delimiters — no closing slash —
      and sometimes append a stray quote. Every existing <tool>…</tool>
      pattern requires a slashed closer, so GemmaParser found nothing and the
      call was returned as plain text: kilocode saw no tool_call and showed no
      reply. Add parse_tool_tag_json_calls, which extracts the brace-balanced
      object after each marker via the tolerant loose parser (so trailing junk
      and stray quotes don't break it) and reads name + arguments/parameters,
      restricted to declared tool names and de-duplicated. Wired into
      GemmaParser before the generic fallback.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      fc828989
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: fold system role by template signal, not just architecture · 64eb74b7
      Stefy Lanza (nextime / spora ) authored
      Whether a model rejects the 'system' role is a property of the chat
      template baked into the specific GGUF, not the architecture: the gemma-2
      template and the official gemma template raise "System role not
      supported", while 'heretic' gemma4 quant conversions ship a permissive
      template that accepts system. Detect from the embedded
      tokenizer.chat_template (raise_exception/"system role") and fold only
      when it actually rejects system; fall back to architecture (Gemma) when
      no template is readable. Avoids needlessly folding permissive Gemma
      models while still covering gemma-2-9b and strict non-Gemma templates.
      The runtime "System role not supported" retry remains as a safety net.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      64eb74b7
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: fold system message into user turn when template rejects it · 39a62745
      Stefy Lanza (nextime / spora ) authored
      Gemma's chat template has no 'system' role; llama.cpp raises "System
      role not supported" and the generation fails (the Kilo client always
      sends a system prompt). On that specific error, retry with the system
      message(s) folded into the first user turn — Gemma's own convention,
      and a no-op for models that accept system. Handles both streaming and
      non-streaming paths and preserves multimodal (list) content.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      39a62745
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: resolve bare-alias gguf locally before trying HuggingFace · eb138bfa
      Stefy Lanza (nextime / spora ) authored
      VulkanBackend.load_model treated any id that isn't a '.gguf' path / file
      / URL as a HuggingFace repo to download. A configured gguf addressed by
      its automatic alias ('coe-…-q4_k_m', no extension) thus 404'd against
      the Hub instead of loading the local file. Resolve the alias via
      _resolve_local_gguf (configured-entry + cache-dir match) first; only fall
      back to the HF path when no local gguf is found.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      eb138bfa
    • Stefy Lanza (nextime / spora )'s avatar
      front: route gguf bare alias by capability to its real engine, not nvidia · 482e47cf
      Stefy Lanza (nextime / spora ) authored
      pick_engine honours the front's assignment (radeon) only if the engine
      can_serve the request's required capability. But _required_cap derived
      that capability from the bare alias 'coe-…-q4_k_m' — no literal 'gguf' —
      so required_capability returned 'transformers' (CUDA-only). radeon is
      gguf-only, failed can_serve, and the request fell through to the default
      engine (nvidia), even though compute_assignment had correctly placed the
      model on radeon (it sees the full '…-q4_k_m.gguf' path).
      
      Resolve the model's configured path in _load_pins (now indexed by the
      .gguf-stripped stem too) and, when the name heuristic yields
      'transformers' but that path is a .gguf, correct the capability to
      'gguf'. whisper/ds4 precedence is unchanged. Combined with the registry
      stem-matching, a bare-alias request now lands on the owning Vulkan/AMD
      engine.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      482e47cf
    • Stefy Lanza (nextime / spora )'s avatar
      vulkan: read free VRAM from amdgpu sysfs (CUDA query is NVIDIA-only) · f204f399
      Stefy Lanza (nextime / spora ) authored
      _free_vram_gb() used torch.cuda.mem_get_info, which returns 0 on an
      AMD/Vulkan engine (no CUDA). That made the auto-offload sizing guard
      (_free > 0) silently false, so n_gpu_layers stayed at -1 (all) and a
      model larger than VRAM was forced entirely onto the GPU — OOM, "Failed
      to load model from file" (e.g. a 13 GB gemma4 model on an 8 GB RX 580).
      A 24 GB CUDA card has room for all layers, so the bug was invisible
      there. Fall back to amdgpu sysfs (mem_info_vram_total - vram_used,
      indexed by Vulkan device order) so AMD GPUs size partial offload too.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      f204f399
    • Stefy Lanza (nextime / spora )'s avatar
      manager: classify gguf by resolving alias to a local .gguf file · 269824b2
      Stefy Lanza (nextime / spora ) authored
      load_model() decided gguf-vs-HF purely from the literal string 'gguf' in
      the model name. A gguf whose alias carries only the quant suffix (e.g.
      'coe-gemma4-coding-hc-14b-a4b-q4_k_m', no literal 'gguf') was mis-routed
      to the HF/transformers backend, which then failed with "is not a valid
      model identifier" (503). Fall back to _resolve_local_gguf(): if the alias
      maps to an actual local .gguf, treat it as gguf and route to llama.cpp.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      269824b2
    • Stefy Lanza (nextime / spora )'s avatar
      multi-engine: route gguf automatic alias (filename without .gguf) · 2eda7574
      Stefy Lanza (nextime / spora ) authored
      A gguf model's assigned/loaded key is its file path, but /v1/models
      advertises it — and clients address it — by the filename without the
      .gguf suffix (the automatic alias). engine_for_assigned /
      engine_for_model / _key_matches_path compared short names verbatim, so
      the automatic alias never matched the .gguf key and routing fell through
      (404 / wrong engine). Normalize both sides via _short_stem so the
      automatic alias resolves to the owning engine with no manual alias.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2eda7574
    • Stefy Lanza (nextime / spora )'s avatar
      multi-engine: live /v1/models on config change + accept gguf-stem ids · 79c2e44d
      Stefy Lanza (nextime / spora ) authored
      Two bugs made a freshly-configured model unusable until a full restart on a
      multi-engine node:
      
      1. Name mismatch: list_models advertises a gguf's filename WITHOUT .gguf as an
         id, but get_all_allowed_identifiers only allowed the name WITH .gguf, so a
         request using the id from /v1/models was 404'd as "not an allowed model".
         Now the .gguf-stripped stem is allowed too.
      
      2. Stale per-engine assignment: each engine's /v1/models is filtered by the
         assignment set fixed at startup, and secondary engines never re-read
         models.json — so an added/removed model didn't show up or route until
         restart. The front now watches models.json mtime, recomputes the
         assignment, updates its router, and pushes it to every engine via a new
         internal POST /internal/reload-config (re-reads models.json +
         set_assigned_models). /v1/models and routing now reflect add/remove live.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      79c2e44d
    • Stefy Lanza (nextime / spora )'s avatar
      admin: strict 1:1 whisper model<->runner linked by config alias == runner id · f0dcf7eb
      Stefy Lanza (nextime / spora ) authored
      The whisper-server runner's model_id is inherited from the gguf MODEL config's
      alias, which links the two. So:
      - Adding a model config creates one runner whose id is the config's alias
        (auto-minted + stamped onto the config when no alias is given).
      - Removing a config (by config_id or by path) tears down the runner whose id
        matches that config's alias — one config removed = one runner removed/killed.
      
      Replaces the interim model_config_id link.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      f0dcf7eb
    • Stefy Lanza (nextime / spora )'s avatar
      whisper: account a running runner as a loaded model for VRAM eviction · 2a214215
      Stefy Lanza (nextime / spora ) authored
      Starting a whisper-server runner loads the gguf onto the GPU, but it was
      invisible to the VRAM-eviction logic — it never evicted others to make room,
      recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.
      
      - WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
        free its VRAM like any other model.
      - MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
        other models if free VRAM is short, start the subprocess, and register it in
        models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
        trigger for eviction and an eviction candidate.
      - stop_whisper_server(): stop + clear all that accounting (frees VRAM).
      - Routed every start/stop through these: on-request transcription, engine
        startup pre-load, admin model-load (Load button) and model-unload/disable.
      
      So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      2a214215
    • Stefy Lanza (nextime / spora )'s avatar
      admin: whisper gguf model auto-manages its runner (1:1) · 3d551444
      Stefy Lanza (nextime / spora ) authored
      Model a whisper gguf as two things: a MODEL config (a .gguf entry with
      backend=whisper-server and NO model_path — enables the model, holds load
      strategy, shown on the GGUF row) and a RUNNER (backend=whisper-server WITH
      model_path — the subprocess, shown in the whisper card).
      
      - Enabling a .gguf with speech_to_text marks it backend=whisper-server and
        auto-creates exactly one runner (1:1) on a free port.
      - Disabling the model removes + kills all its runners (cascade by model_path).
      - Removing a runner (or model) now stops the subprocess + drops registry
        entries, instead of leaving it running until restart.
      - cached-models shows the model config on the GGUF row but excludes runners;
        the whisper card shows only runners (require model_path).
      - engine startup only launches runners (entries with model_path), never the
        bare model config.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      3d551444
    • Stefy Lanza (nextime / spora )'s avatar
      admin: keep GGUF model config and whisper-server runner config separate · 41e3661e
      Stefy Lanza (nextime / spora ) authored
      A GGUF model config (configure + enable the model, load strategy) and a
      whisper-server config (the runner: port, gpu, which model) are two distinct
      things. Showing whisper-servers as the backing file's configs made the GGUF
      row's "Configure" open the whisper form — conflating them.
      
      Whisper-server entries are again excluded from a GGUF file's editable config
      list (they live only in their own card); the GGUF row's Configure opens the
      general model config modal. The file still reflects "loaded" via its
      model_path in the loaded-status sets.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      41e3661e
    • Stefy Lanza (nextime / spora )'s avatar
      admin: show whisper-servers as gguf configs again, edit via whisper editor · 615967d8
      Stefy Lanza (nextime / spora ) authored
      Reverted the over-correction that hid whisper-server entries from the
      backing GGUF file's config list — they should appear there (the file is
      configured through them). To avoid the duplicate-config bug, editing a
      whisper-server config from the GGUF row now routes to the whisper editor
      (which updates in place by id) instead of the general config modal. Pills
      for whisper-server configs are labelled by id so the two instances are
      distinguishable (they share the "whisper" alias).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      615967d8
    • Stefy Lanza (nextime / spora )'s avatar
      admin: fix duplicate gguf configs from whisper-server pollution · ea8dd92c
      Stefy Lanza (nextime / spora ) authored
      Editing a GGUF model's config kept appending duplicate entries. Root cause:
      api_cached_models added whisper-server entries to the backing GGUF file's
      config list (keyed by model_path). With whisper0/whisper1 both pointing at
      the file, the GGUF row's configs[0] became a whisper-server entry, which
      carries no config_id — so "Configure" on that row treated every save as a
      brand-new config and spawned a fresh duplicate each time.
      
      - cached-models: skip whisper-server entries entirely (they're managed in
        their own card; the file still shows "loaded" via its model_path key).
      - model-configure (whisper-server): update an existing entry in place when
        the id matches instead of 409-or-append, preserving unmanaged fields
        (engine, config_id).
      - model-disable: guard against whisper-server entries' path=None so a
        path-based disable can't crash on basename(None).
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      ea8dd92c
    • Stefy Lanza (nextime / spora )'s avatar
      admin: stop the models page jumping to the bottom on refresh · d7636907
      Stefy Lanza (nextime / spora ) authored
      Clicking Load/Unload (or any refreshLocal) re-ran loadCachedModels, which
      blanked the HF/GGUF lists to "Loading…" every time. That collapsed the page
      height and threw the viewport to the end, so a whisper-server unload looked
      like it "did nothing and scrolled to the bottom" even though it worked.
      
      Now the "Loading…" placeholder only shows on the first (empty) load; on a
      refresh the existing rows stay in place and the scroll position is captured
      and restored around the re-render.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      d7636907
    • Stefy Lanza (nextime / spora )'s avatar
      admin: whisper-server load mode derives from the backing GGUF config · b717f1dc
      Stefy Lanza (nextime / spora ) authored
      Removed the redundant "Load mode" dropdown from both whisper-server forms
      (builder + edit modal). A whisper-server is backed by a GGUF model, so its
      load mode now derives from that GGUF's configured load_mode (default
      on-request) instead of being set separately in two places. Load/offload
      strategy stays solely in the GGUF model config.
      Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
      b717f1dc