- 19 Jun, 2026 35 commits
-
-
Stefy Lanza (nextime / spora ) authored
Whether a model rejects the 'system' role is a property of the chat template baked into the specific GGUF, not the architecture: the gemma-2 template and the official gemma template raise "System role not supported", while 'heretic' gemma4 quant conversions ship a permissive template that accepts system. Detect from the embedded tokenizer.chat_template (raise_exception/"system role") and fold only when it actually rejects system; fall back to architecture (Gemma) when no template is readable. Avoids needlessly folding permissive Gemma models while still covering gemma-2-9b and strict non-Gemma templates. The runtime "System role not supported" retry remains as a safety net. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Gemma's chat template has no 'system' role; llama.cpp raises "System role not supported" and the generation fails (the Kilo client always sends a system prompt). On that specific error, retry with the system message(s) folded into the first user turn — Gemma's own convention, and a no-op for models that accept system. Handles both streaming and non-streaming paths and preserves multimodal (list) content. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
VulkanBackend.load_model treated any id that isn't a '.gguf' path / file / URL as a HuggingFace repo to download. A configured gguf addressed by its automatic alias ('coe-…-q4_k_m', no extension) thus 404'd against the Hub instead of loading the local file. Resolve the alias via _resolve_local_gguf (configured-entry + cache-dir match) first; only fall back to the HF path when no local gguf is found. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
pick_engine honours the front's assignment (radeon) only if the engine can_serve the request's required capability. But _required_cap derived that capability from the bare alias 'coe-…-q4_k_m' — no literal 'gguf' — so required_capability returned 'transformers' (CUDA-only). radeon is gguf-only, failed can_serve, and the request fell through to the default engine (nvidia), even though compute_assignment had correctly placed the model on radeon (it sees the full '…-q4_k_m.gguf' path). Resolve the model's configured path in _load_pins (now indexed by the .gguf-stripped stem too) and, when the name heuristic yields 'transformers' but that path is a .gguf, correct the capability to 'gguf'. whisper/ds4 precedence is unchanged. Combined with the registry stem-matching, a bare-alias request now lands on the owning Vulkan/AMD engine. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
_free_vram_gb() used torch.cuda.mem_get_info, which returns 0 on an AMD/Vulkan engine (no CUDA). That made the auto-offload sizing guard (_free > 0) silently false, so n_gpu_layers stayed at -1 (all) and a model larger than VRAM was forced entirely onto the GPU — OOM, "Failed to load model from file" (e.g. a 13 GB gemma4 model on an 8 GB RX 580). A 24 GB CUDA card has room for all layers, so the bug was invisible there. Fall back to amdgpu sysfs (mem_info_vram_total - vram_used, indexed by Vulkan device order) so AMD GPUs size partial offload too. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
load_model() decided gguf-vs-HF purely from the literal string 'gguf' in the model name. A gguf whose alias carries only the quant suffix (e.g. 'coe-gemma4-coding-hc-14b-a4b-q4_k_m', no literal 'gguf') was mis-routed to the HF/transformers backend, which then failed with "is not a valid model identifier" (503). Fall back to _resolve_local_gguf(): if the alias maps to an actual local .gguf, treat it as gguf and route to llama.cpp. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
A gguf model's assigned/loaded key is its file path, but /v1/models advertises it — and clients address it — by the filename without the .gguf suffix (the automatic alias). engine_for_assigned / engine_for_model / _key_matches_path compared short names verbatim, so the automatic alias never matched the .gguf key and routing fell through (404 / wrong engine). Normalize both sides via _short_stem so the automatic alias resolves to the owning engine with no manual alias. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Two bugs made a freshly-configured model unusable until a full restart on a multi-engine node: 1. Name mismatch: list_models advertises a gguf's filename WITHOUT .gguf as an id, but get_all_allowed_identifiers only allowed the name WITH .gguf, so a request using the id from /v1/models was 404'd as "not an allowed model". Now the .gguf-stripped stem is allowed too. 2. Stale per-engine assignment: each engine's /v1/models is filtered by the assignment set fixed at startup, and secondary engines never re-read models.json — so an added/removed model didn't show up or route until restart. The front now watches models.json mtime, recomputes the assignment, updates its router, and pushes it to every engine via a new internal POST /internal/reload-config (re-reads models.json + set_assigned_models). /v1/models and routing now reflect add/remove live. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
The whisper-server runner's model_id is inherited from the gguf MODEL config's alias, which links the two. So: - Adding a model config creates one runner whose id is the config's alias (auto-minted + stamped onto the config when no alias is given). - Removing a config (by config_id or by path) tears down the runner whose id matches that config's alias — one config removed = one runner removed/killed. Replaces the interim model_config_id link. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Starting a whisper-server runner loads the gguf onto the GPU, but it was invisible to the VRAM-eviction logic — it never evicted others to make room, recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted. - WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can free its VRAM like any other model. - MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict other models if free VRAM is short, start the subprocess, and register it in models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a trigger for eviction and an eviction candidate. - stop_whisper_server(): stop + clear all that accounting (frees VRAM). - Routed every start/stop through these: on-request transcription, engine startup pre-load, admin model-load (Load button) and model-unload/disable. So: starting a runner = a model load (evicts as needed); unloading = frees VRAM. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Model a whisper gguf as two things: a MODEL config (a .gguf entry with backend=whisper-server and NO model_path — enables the model, holds load strategy, shown on the GGUF row) and a RUNNER (backend=whisper-server WITH model_path — the subprocess, shown in the whisper card). - Enabling a .gguf with speech_to_text marks it backend=whisper-server and auto-creates exactly one runner (1:1) on a free port. - Disabling the model removes + kills all its runners (cascade by model_path). - Removing a runner (or model) now stops the subprocess + drops registry entries, instead of leaving it running until restart. - cached-models shows the model config on the GGUF row but excludes runners; the whisper card shows only runners (require model_path). - engine startup only launches runners (entries with model_path), never the bare model config. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
A GGUF model config (configure + enable the model, load strategy) and a whisper-server config (the runner: port, gpu, which model) are two distinct things. Showing whisper-servers as the backing file's configs made the GGUF row's "Configure" open the whisper form — conflating them. Whisper-server entries are again excluded from a GGUF file's editable config list (they live only in their own card); the GGUF row's Configure opens the general model config modal. The file still reflects "loaded" via its model_path in the loaded-status sets. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Reverted the over-correction that hid whisper-server entries from the backing GGUF file's config list — they should appear there (the file is configured through them). To avoid the duplicate-config bug, editing a whisper-server config from the GGUF row now routes to the whisper editor (which updates in place by id) instead of the general config modal. Pills for whisper-server configs are labelled by id so the two instances are distinguishable (they share the "whisper" alias). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Editing a GGUF model's config kept appending duplicate entries. Root cause: api_cached_models added whisper-server entries to the backing GGUF file's config list (keyed by model_path). With whisper0/whisper1 both pointing at the file, the GGUF row's configs[0] became a whisper-server entry, which carries no config_id — so "Configure" on that row treated every save as a brand-new config and spawned a fresh duplicate each time. - cached-models: skip whisper-server entries entirely (they're managed in their own card; the file still shows "loaded" via its model_path key). - model-configure (whisper-server): update an existing entry in place when the id matches instead of 409-or-append, preserving unmanaged fields (engine, config_id). - model-disable: guard against whisper-server entries' path=None so a path-based disable can't crash on basename(None). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Clicking Load/Unload (or any refreshLocal) re-ran loadCachedModels, which blanked the HF/GGUF lists to "Loading…" every time. That collapsed the page height and threw the viewport to the end, so a whisper-server unload looked like it "did nothing and scrolled to the bottom" even though it worked. Now the "Loading…" placeholder only shows on the first (empty) load; on a refresh the existing rows stay in place and the scroll position is captured and restored around the re-render. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Removed the redundant "Load mode" dropdown from both whisper-server forms (builder + edit modal). A whisper-server is backed by a GGUF model, so its load mode now derives from that GGUF's configured load_mode (default on-request) instead of being set separately in two places. Load/offload strategy stays solely in the GGUF model config. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Two whisper-server UI issues on a multi-engine node: - Unload didn't visibly take effect: a whisper-server is registered both as a subprocess (whisper_servers) AND in the generic .models/.model_pools registry under audio:<id>. The unload stopped the subprocess but left the registry entry, so the row kept showing "Unload". Now it also drops the audio:<id>/<id> registry entries (and matches by id, audio:<id>, or the gguf model_path, so unloading the file stops every server using it). - The backing gguf file showed "Load" while its whisper-server was running. Surface each running server's _model_path in the loaded-key sets (engine-state, model-loaded-status, status) so the GGUF-file row reflects that the file is in use. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
/admin/api/models is served by the primary engine and ran list_models() with the per-engine assignment filter, so models pinned to a secondary engine (whisper-servers configured with engine=radeon) were dropped. That emptied the whisper-server config form and left those models unmatched in the loaded-status check, so they never showed as loaded. Add list_models(all_engines=True) to bypass _entry_assigned, and call it from the admin model list. Combined with the whisper-servers now reported in /internal/engine-state, the radeon whisper-servers reappear in their own form and show as loaded. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
The supervisor polls /internal/engine-state (not /admin/api/status) to populate each engine's loaded_models in the registry, which the front then aggregates for the models page. That endpoint only read .models, so a whisper-server running on a secondary engine still showed as not loaded. Fold each running server (id + `audio:` alias) into its loaded_models. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Two issues when unloading/reporting models on a multi-engine node: - Unload didn't free VRAM for pooled models. api_model_unload only popped multi_model_manager.models and never touched model_pools, so a model served with max_instances>1 (which lives only in the pool) kept all its instances resident. Now it searches both dicts and calls unload_model(), which cleans up the whole pool + runs gc/empty_cache. Also handles whisper-server models (their own subprocess) by stopping the server. - whisper-server showed as "not loaded". It runs as a subprocess tracked in whisper_servers, not in .models. Fold each running server (id + `audio:` alias) into both the model-loaded-status list and the /admin/api/status loaded_keys, so the models page, dashboard count and per-engine box all reflect it (incl. on a secondary engine). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
model-load/model-unload were proxied to the primary engine, so unloading (or loading) a model that lives on a secondary engine hit the wrong process and silently no-op'd (was_loaded=False). Add front-proxy interceptors: - unload: find the engine whose loaded_models matches the path and forward the request there; fall back to the primary. - load: reuse an engine already serving the model, else the model's engine pin from models.json, else the primary. Registered before the catch-all proxy, mirroring /admin/api/engines. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- Tasks page: the engine box "N models" count now has a hover tooltip listing every loaded model key on that engine (dotted underline + help cursor); "no models loaded" when empty. - Models page: models loaded on a non-primary engine were shown as idle. /admin/api/model-loaded-status is served by the primary engine and only reported its own pool. Added a front-proxy interceptor that proxies to the primary then unions in every other engine's loaded_models, mirroring the existing /admin/api/engines and /admin/api/status aggregation. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
The image's demo tool UIs (video editor :8420, videogen :7790, township :7788, parler TTS :8123) are now individually toggleable at container start — no rebuild. supervisord.conf drives each program's autostart from a CODERAI_TOOL_* env var; the entrypoint seeds defaults (three UIs on, parler off) so the %(ENV_...)s expansion always resolves. run_oci.sh (shipped as the dist coderai-docker runner) gains: --no-tools disable all three demo UIs --enable-tool NAME force-enable one (also turns on parler) --disable-tool NAME disable one; explicit toggles override --no-tools NAME ∈ {video-editor, videogen, township, parler}. README + --help updated. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
coderai disables hf_xet by default (it bypasses the tqdm progress hook and can segfault the worker), but some HF blobs are Xet-only and the plain HTTPS path refuses them with "file too large … install hf_xet" — even though hf_xet is bundled. The first pass now holds that error instead of surfacing it, detects the Xet-required message, and transparently retries with Xet enabled (force_xet → HF_HUB_DISABLE_XET=0). Non-Xet errors are surfaced as before; the existing crash→disable-Xet retry is unchanged. _attempt now returns the held error message as a 4th tuple element. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
When ds4.auto_download is enabled and a deepseek4 request resolves no local GGUF, the downloaded weight variant is now relocated into coderai's GGUF cache (get_model_cache_dir; move on same FS, symlink across devices) and registered in models.json as a text_models entry that mimics the requested ("failed") model's config — backend auto, on-request, enabled and visible (removed from unloaded/to_download). model_name is threaded ds4 backend → ensure_service → ensure_model so the registration mirrors the right entry. Also: settings "Extra ds4-server args" hint/placeholder updated to reflect the auto --kv-disk-dir and SSD-streaming expert-cache sizing (--ssd-streaming-cache-experts), noting Q2_K can fail ds4's CUDA prefill. Diagnosis (no code change): ds4-server's "cuda prefill failed" on the 93GB Q2_K variant is a quant-specific ds4 CUDA bug — the 154GB Q4_K completes prefill fine (verified: "prompt done 434s" vs Q2_K instant failure), with 15.8GB VRAM free either way (not OOM, not cache budget, not coderai). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
ds4-server streams MoE experts and wants the whole GPU for its expert cache, but coderai's modest VRAM estimate for a ds4 model let it co-reside another model — starving the cache so ds4's layer-0 FFN expert encode failed ("gpu layer 0 ffn batch encode failed"). When loading a ds4 model on demand, unload all other models first so ds4-server gets the full GPU. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- Always keep the CURRENT request (last message) intact and as the very last message after compaction (the compacted history/summary precedes it). - summarize strategy now CHUNKs the older history and summarizes map-reduce (per-chunk then a combined pass) so the summarization prompt can't itself overflow. - If compaction still can't fit the window (e.g. a single huge final message), return HTTP 400 "request too big for context" instead of failing mid-generation. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
When enabled for a model, if the prompt would exceed auto_compact_pct% of the model's context window, the conversation is shrunk to ~65% before generation instead of erroring on overflow. Per-model config (auto_compact / auto_compact_pct / auto_compact_strategy) with three strategies: - drop_oldest : keep system messages + the most recent turns that fit. - keep_head_tail : also keep the first user turn as an anchor + a count note. - summarize : replace the dropped middle with a best-effort LLM summary (generated by the loaded model; falls back to a count note). Token size is a cheap chars/4 estimate; membership uses object identity so value-equal turns don't collide. Wired into the chat path (codai/api/text.py), the model-configure whitelist, and the model config modal UI. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- ds4: resolve a bare/aliased model id (e.g. "Foo-ds4-Q2_K", no path/extension) to its configured .gguf via a config/cache-aware resolver — fixes the 503 ("no local deepseek4 GGUF resolved") on chat requests (only "Load now" with a full path worked before). Ds4Backend reuses the same resolver. - ds4: report a modest VRAM footprint for ds4 models (measured or ~12GB) instead of the 100GB+ GGUF size — ds4-server streams experts from SSD and manages its own memory, so the old estimate forced needless ~128GB eviction churn every request. - ds4: route on-disk KV checkpoints into coderai's offload directory by default (--kv-disk-dir <offload>/ds4-kv) unless overridden in extra_args. - config: tolerant load (_dc drops unknown keys) so a stale/newer config.json never crashes the whole load and silently resets ALL settings to defaults (the "had to reconfigure everything" bug). save_config + GET/POST settings carry the new ds4 fields (model_path, auto_download, ssd_streaming). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- Route to ds4 by GGUF ARCHITECTURE (general.architecture == "deepseek4"), read from the file header (cached) — not by filename. Mainline deepseek/2/3/32 GGUFs stay on llama.cpp; the model_id alias still routes for the download case. - ds4-server now serves the REQUESTED GGUF: Ds4Backend resolves the model to a local .gguf and launches `ds4-server -m <file>` (resolve_service_key keys the managed service per file). No fixed-variant assumption. - Honour the model's per-entry n_ctx for ds4-server --ctx (over the global ctx). - New config.ds4 options + settings UI: ssd_streaming (--ssd-streaming, stream MoE experts from SSD/disk), model_path (explicit -m override), and auto_download (OFF by default — only serve GGUFs already present; error clearly instead of silently pulling tens of GB; opt in to fetch model_variant). - AI.PROMPT: document DeepSeek-V4 = pending upstream llama.cpp PRs (needs new ggml ops) → ds4 for now; and ds4 routing/offload/text-only specifics. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- Streaming tool gate now withholds the gemma/qwen native `<|tool_call>` marker (and partials) too, not just `<tool_call>`/`call:NAME{` — so the raw marker no longer leaks to the client mid-stream (Kilo was executing partial calls). - Normalize tool-call function.arguments from JSON string → dict before applying the chat template, so templates that render `arguments|items` (Qwen) don't raise "Can only get item pairs from a mapping". - Context-window overflow now returns a meaningful error: a structured SSE error event (code context_length_exceeded) when streaming, or HTTP 400 with a clear message for non-streaming — instead of injecting "[Generation error: …]" as assistant content (which polluted chat history). - Models page: unconfigured GGUF files now expose the "Free disk" button (records them as "to download" before deleting), matching HF models. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- api_model_load: load a GGUF/text model via llama.cpp even when it's also bucketed under image/vision (respect the entry's primary model_type), so a gemma+mmproj LLM never hits the diffusers from_pretrained() path. - model config save: a GGUF LLM with an mmproj auto-gets the image_to_text capability and is kept out of the diffusers vision_models/image_models buckets. - VRAM estimate: _runtime_reserve_gb scales the KV-cache reserve by the cache quantization (q4_0 ≈ 0.27× f16) so quantized-KV models at large context aren't over-estimated into needless CPU offload. - Free disk (HF): quiet huggingface_hub's noisy not-found traceback and make the delete idempotent (repo already gone = success). - Tasks page: generation tasks now report it/s (or s/it when slow); text keeps tok/s. Throughput computed centrally in the task registry (live EMA + run average on finish). New "Recent tasks (last 10)" history section. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- make_dist_bundle.sh: assemble dist/coderai-docker-dist.tar containing the image tarball (docker save | pigz), the coderai-docker runner (run_oci.sh, image tag pinned), install.sh and README. Stages under dist/ (not tiny /tmp) and hardlinks the multi-GB image tarball instead of copying it. - dist-bundle/install.sh: docker-load the image (sudo-fallback for daemon access) then install the coderai-docker runner to /usr/local/bin (root) or ~/.local/usr/bin (user, added to ~/.bashrc PATH if missing). - build_oci_image.sh: after a successful build, export + bundle for distribution by default (--no-dist to skip). - run_oci.sh: default image tag -> coderai:dist (matches what's shipped/loaded). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Merge feat/township-match-upload: to-download list, mmproj vision, styled modals, broker + packaging
-
Stefy Lanza (nextime / spora ) authored
Web UI / models: - "To download" wishlist: models known but not on disk and not configured show as non-configured to-download rows. Free-disk on an unconfigured model, Remove on a model with no files left, and a new "Add to list" button in the download window all record into models.json `to_download`; pruned on enable/download. New endpoints model-mark-download / model-unmark-download. - mmproj multimodal components: mmproj GGUFs are classified as components (not models), selectable per-GGUF in the model config (auto-selected, enables vision capability). VulkanBackend loads them via llama.cpp's MTMDChatHandler (--mmproj equivalent), and the chat path now forwards image_url content end-to-end. - All window.alert() replaced by a shared styled showAlert()/showConfirm() modal in base.html (used across every admin template). Front proxy / broker: - Fix engine model-assignment NameError (keep -> _keep). - Brokered GET /coderai/capabilities now answers from the front (whole node) so multi-GPU hosts report every card, not a single engine's CUDA-visible one. - Log a clear reason when the broker is disabled. Packaging (distributable OCI image): - Multi-stage venv image + smoke test; bundle ds4/wav2lip/sadtalker + parler; whisper-server etc. dereferenced (cp -aL) so no dangling symlinks. - Dockerfile.update + update_oci_image.sh: ~30s incremental code-only rebuild on an immutable coderai:base (no 20GB bundle recopy). - run_oci.sh: --local/--config-dir + --map to run against existing local config and data dirs without a rebuild; --debug[=flags] + --log-file for selectable debug flags and a host-tailable file log (launcher tees; supervisord kills the process group). tmp_janitor age-prunes the dedicated temp dir. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
- 18 Jun, 2026 5 commits
-
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
GPTQModel silently leaves layers it can't map (e.g. gemma-4's fused batched MoE experts) in bf16, producing a near-full-size "checkpoint" that the loader would redirect to and then offload. The worker now scans the saved safetensors and, if <50% of large weight bytes are int-packed, deletes the output and marks the job failed (so it falls back to bitsandbytes) instead of reporting "done". Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- quant jobs now appear on the Tasks page (api_tasks emits kind=quantize) and as a live badge on the HF model-list row (polled; re-renders only on change). - persist job state to <cache>/quantized/jobs.json; on startup a job left "running" is marked "interrupted" only if its owning PID is dead (merge-safe save so multiple processes don't clobber each other). - gitignore the runtime model cache (models/), logs/, and the third-party GPTQModel/ source clone (installed into the venv, not part of this repo). Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
- Prefix front/uvicorn and re-emitted engine log lines with [HH:MM:SS] so the front log format matches the engine ([HH:MM:SS][nvidia] …); preserve tqdm in-place progress and avoid double-timestamping already-tagged lines. - gpu_detect: _amd_gpu_name() resolves a card's marketing name via amdgpu product_name sysfs, then lspci board/chip name, then vulkaninfo. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-
Stefy Lanza (nextime / spora ) authored
Smart context caching (both text backends): - Per-instance generation lock so pooled concurrent requests can't corrupt a shared KV cache (GGUF + HF, incl. streaming worker thread). - GGUF: enable multi-slot LlamaRAMCache, budget via kv_cache_budget_mb (512MB). - HF: replace single exact-text KV slot with an LRU of token-prefix slots + token-level longest-common-prefix + DynamicCache clone/crop (handles mid-history edits); kv_cache_slots (default 3). - Session-affinity routing in ModelInstancePool.acquire(session_key); key from user/X-Session-Id else a stable prefix hash. - RAM-pressure ladder drops reclaimable prefix caches before evicting models. VRAM fix: - Auto-fit check no longer double-counts the KV/activation reserve when expected_vram_gb is already a peak estimate — borderline models (e.g. gemma-4-26B-A4B) stay GPU-resident instead of forced into MoE-thrashing device_map offload. GPTQ/AWQ fast-kernel quant backend (HF path): - New codai/models/quant.py: GPTQModel capability detection, quantized-checkpoint cache, on-demand background quantize job (falls back to bnb if unsupported). - quant_backend config (auto|bnb|gptq|awq); loader auto-uses a quantized checkpoint with Marlin/ExLlama when present, else bitsandbytes. - Admin endpoints + "Quantize to 4-bit" button with live status on the model page. - requirements-nvidia.txt documents the from-source install + numpy caveat. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
-