Commits · master · nexlab / coderai

03 Jul, 2026 3 commits

township: Run page invents characters/environments from scratch, phased · 15a0e81b

Stefy Lanza (nextime / spora ) authored Jul 03, 2026

Two changes to the Run-page character/environment generation:

1. Generate FROM SCRATCH via the text model instead of the built-in static pool.
   stage_characters/stage_environments always iterated FIGHTER_POOL/
   ENVIRONMENT_POOL and used their hardcoded (pre-fallback) prompts — the LLM was
   never invoked for a full run. New _invent_profiles() calls the text model
   (_autogen_profile_payload) to invent fresh profiles; the static pool is only a
   fallback when no text model is configured (with a clear warning).

2. Phase the pipeline: invent ALL prompts first, then render ALL reference images,
   then train image LoRAs, then video LoRAs (prompts → images → image-LoRA →
   video-LoRA). New _render_profile_images() does the image phase from the saved
   prompts; the reuse/skip paths are unchanged. num_fighters/num_environments set
   how many to invent (default: the pool size).

(CLI main() still uses the pool-based stage_characters/stage_environments; the
web Run page is the phased-from-scratch path.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

15a0e81b

lora-train: cross-engine GPU lock so nothing reloads mid-training (fix intermittent OOM) · 18c6a358

Stefy Lanza (nextime / spora ) authored Jul 03, 2026

evict_cosited_siblings() frees the co-located sibling's VRAM ONCE at training
start, but training runs for minutes as an in-engine background job that the front
swap-gate can't cover (its POST returns a job_id immediately). So a concurrent LLM
request reloaded the gguf text model mid-training and OOM'd the trainer (one
fighter LoRA failed while others succeeded).

Add codai/models/gpu_lock.py: a cross-engine GPU reservation. Training reserves the
card for its whole duration — locally AND on co-located siblings via new
/internal/gpu-reserve + /internal/gpu-release endpoints — and every ordinary
model-load path (manager.request_model, video _load_video_pipeline, image
_load_diffusers_pipeline) calls wait_until_free() first. A request that needs a
load during training now BLOCKS until training releases, then loads and serves —
the same queue-behind-the-owner behaviour the swap-gate gives request-level work,
now extended to cover training. The training thread is exempt from its own
reservation, so loading the base model never self-deadlocks; waits are bounded
(900s) so a stuck reservation can't hang forever.

Validated: sibling reservation blocks a loader thread until release; the reserving
thread never waits on itself.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

18c6a358

images: stop video's flash-attn backend leaking to image models (Z-Image attn_mask crash) · 0043eb2a

Stefy Lanza (nextime / spora ) authored Jul 03, 2026

diffusers' Model.set_attention_backend() doesn't just set per-processor backends
— it ALSO flips a process-wide active backend (attention_dispatch's
_active_backend). The video path sets that to flash-attn for the Wan transformer;
image and video share the nvidia-engine process, so the global stayed flash and
leaked to the next image model. Z-Image's transformer sets no backend of its own
(passes backend=None → uses the global) and its attention is masked, so it
crashed with "`attn_mask` is not supported for flash-attn 2" → image/environment
generation 400. reset_attention_backend() clears per-processor backends but NOT
the global, so it didn't help.

Fix: restore the diffusers global backend to the env default (native/SDPA)
(a) before every image generation — bulletproof against a leaked flash backend —
and (b) in the video pipeline teardown (_free_pipeline_vram), so it can't persist
after a video pipe is freed. Masked image attention (SDPA) now always works; the
video transformer keeps its own per-processor backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

0043eb2a

02 Jul, 2026 2 commits

lora-train: evict co-located sibling engine's VRAM before training (fix OOM) · e300e9c4

Stefy Lanza (nextime / spora ) authored Jul 02, 2026

LoRA training freed VRAM with unload_all_models(), which only unloads THIS
engine's models. On the GGUF-isolation split the co-located gguf (text) engine
kept its model resident (~7.4 GB), so fp32 training (~16 GB) + the sibling
exceeded the 24 GB card → "CUDA out of memory. Tried to allocate 32 MiB … 26 MiB
free … Process 226 has 7.36 GiB" — every fighter LoRA (dlaba, zigo, zlo, …)
failed. Training also isn't covered by the front swap-gate, so nothing else
cleared the sibling.

Add multi_model_manager.evict_cosited_siblings(): invoke the registered
cross-engine VRAM releasers (the cosite releaser posts wait=True, so it waits for
a busy sibling to reach a safe point). Call it right after unload_all_models() in
both training paths (image + video/Wan), so training gets the whole card.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

e300e9c4

images: quantized models must not load in f32 (FlashAttention needs fp16/bf16) · b3014a3d

Stefy Lanza (nextime / spora ) authored Jul 02, 2026

Z-Image-Turbo-unsloth-bnb-4bit (and any pre-quantized bnb/fp8/nf4/gptq/awq
checkpoint, or a runtime-quantized model) dequantizes to a HALF compute dtype
and its transformer uses FlashAttention, which only supports fp16/bf16. The
per-model image loader defaults precision to f32, so such a model loaded in
float32 and crashed with "FlashAttention only support fp16 and bf16 data type"
(image/character generation → 400/500), besides wasting VRAM.

When precision is left at the f32 default AND the model is quantized (name
contains bnb/4bit/8bit/fp8/nf4/gptq/awq, or config sets load_in_4bit/8bit/
component_quantization), load in bf16 instead. Non-quantized models keep the f32
default. --no-ram already forced fp16, so it's unaffected.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

b3014a3d

01 Jul, 2026 4 commits

pipeline-cache: reject corrupt-JSON caches (stop the empty-tokenizer death spiral) · 31793b05

Stefy Lanza (nextime / spora ) authored Jul 01, 2026

A cache dir's completion marker only proves the save FINISHED, not that every
file landed intact. A transient truncated write — repeatedly a 0-byte
tokenizer/tokenizer_config.json — slipped into an otherwise "valid" cache and
then threw JSONDecodeError on EVERY subsequent video load, knocking the pipeline
off its fast path into the offload fallback ladder (balanced→sequential→disk),
which churned for hours, leaked ~22 GB VRAM, and died on a meta-tensor error.

Add _first_bad_json(dir): walk the (small) JSONs and flag the first that is
empty or unparseable — the big weights are .safetensors and aren't scanned, so
it's cheap. Wire it in on BOTH sides:
 - load: valid() and component_valid() now invalidate + return False when any
   cached JSON is corrupt, so a poisoned cache becomes a clean rebuild instead
   of a death spiral.
 - save: save()/save_component() verify the temp dir before committing, and
   mark_monolithic_complete() refuses to finalize a dir with a corrupt JSON —
   so a truncated write is never cached in the first place.
Added invalidate_path(p) helper.

Verified: _first_bad_json flags 0-byte and garbage JSONs, passes clean ones.
The already-poisoned Wan2.2-VACE cache was deleted out-of-band to unblock.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

31793b05

frontproxy: fix leaked GPU-swap slot on request cancellation (queued swap never fired) · df948c48

Stefy Lanza (nextime / spora ) authored Jul 01, 2026

The GpuSwapGate.release() was async and awaited in the dispatch `finally`
blocks. When a request was cancelled/interrupted mid-flight (client disconnect,
an interrupted text generation), `await self._swap_release(...)` in the finally
could itself be cancelled BEFORE it decremented the running counter — stranding
a gate slot. With `running` stuck > 0, `_pump()` never ran, so a video request
queued behind the interrupted text request was never granted: the GPU never
swapped even though the text engine had gone idle (observed: video stuck ~37min
while the text engine was idle for 11).

Fix: make release()/_pump() SYNCHRONOUS and drop the asyncio.Lock. Every critical
section is straight-line (no await between read and write), so under asyncio's
single thread they're already atomic — and a synchronous release from a `finally`
always completes even while the coroutine is being cancelled. acquire() keeps its
one `await` (the event wait) with synchronous cancel-cleanup. All release call
sites are now non-awaited. Added `[gpu-swap] queued/swapping` logging so the
owner/queue/swap transitions are visible in debug.log.

Validated: cancelling a text request whose slot a queued video is waiting behind
now frees the slot and grants the video; a cancelled queued waiter leaks nothing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

df948c48

frontproxy: intelligent per-shared-GPU model-swap queue (batch, then swap) · c0a970b0

Stefy Lanza (nextime / spora ) authored Jul 01, 2026

Builds on the cross-engine clean-swap eviction: instead of two engines on one
shared card ever running forwards concurrently (→ VRAM contention → OOM →
disk-thrash), the front now serializes model OWNERSHIP of a shared GPU while
batching to avoid per-request thrash.

New GpuSwapGate (frontproxy/reqqueue.py), one per shared-GPU group (keyed by the
co-located engines' CODERAI_ENGINE_GPUS selector, created only when an engine has
a sibling on its card):

  * A request for the model that currently OWNS the GPU runs immediately — a swap
    isn't needed (a lone stream never stalls). Concurrency stays capped downstream
    by the existing per-model FrontQueue.
  * A request for a DIFFERENT model queues. The owner keeps being served up to
    `cap` requests (server.gpu_swap_batch, default 10) while another model waits,
    then — once the owner is fully idle (never mid-request) — the GPU SWAPS to the
    waiting model (which evicts + loads), serves it, and round-robins BACK if the
    original has requests queued. No thrash (batch), no starvation (cap).

Wired into all four dispatch paths (broker, broker-stream, direct stream with
keepalive, direct non-stream) for every GPU-inference kind (text/image/video):
acquire the swap slot before the per-model queue, release in the finalizer;
cancelling a pending acquire (client disconnect) drops the waiter with no leak.
The text-stream path emits keepalives while waiting out a swap so the client
doesn't time out.

Scheduler validated by async unit tests: cap engages at exactly N with a
competitor waiting; a lone same-model stream runs unbounded; round-robin
alternates; cancelled waiters leak no slot.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

c0a970b0

manager: clean cross-engine VRAM swap (evict a busy sibling at its unit boundary) · c9791579

Stefy Lanza (nextime / spora ) authored Jul 01, 2026

On the GGUF-isolation split, a torch (video/image) engine and a gguf (text)
engine share one NVIDIA card. When one needed VRAM it asked the co-located
sibling to release via /internal/evict-vram, but that only evicted the
sibling's IDLE models and SKIPPED busy ones — so a text-model load would
proceed into the VRAM an in-flight video clip still needed for its forward,
and BOTH OOM'd. Recovery then laddered the video load down to disk offload
and thrashed for ~1h.

Give the cross-engine path the same wait-then-evict the local eviction
already has: release_idle_vram(needed_gb, wait_for_busy, wait_timeout) first
evicts idle models, then — only if still short — WAITS for each busy model to
reach a safe idle point (between requests, e.g. between video clip parts) and
evicts it. This converts contention into a CLEAN SWAP: the render's current
unit finishes, its model is evicted, the sibling loads alone, and the render
reloads + resumes on its next unit. Bounded by wait_timeout (180s) so two
mutually-waiting busy engines can't deadlock — one gives up and falls back to
its own CPU/disk offload.

/internal/evict-vram now reads needed_gb + wait + wait_timeout from the body
and forwards them; _cosite_vram_releaser sends wait=True with an HTTP timeout
that exceeds the sibling's wait budget so the swap isn't cut short. Symmetric:
both engines register the releaser at each other, so either direction swaps
cleanly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

c9791579

30 Jun, 2026 3 commits

video: fix dual-expert OOM regression + harden pipeline VRAM teardown · e1ab02b1

Stefy Lanza (nextime / spora ) authored Jun 30, 2026

Two fixes for the township video render failing with CUDA OOM and a
~17 GB "untracked teardown leak" that survived gc + empty_cache.

1. Resident-experts regression. video_resident_experts now defaults to
   OFF. A dual-expert 14B model (Wan2.2-VACE-Fun: transformer +
   transformer_2, ~10 GB each at 4-bit) cannot hold both experts + text
   encoder + VAE + the activation peak in 24 GB; the resident load left
   transformer_2 on the CPU yet reported success, so the denoise loop
   (which needs both experts) OOM'd at step 0. 'model' CPU offload keeps
   only the active ~7 GB expert resident and swaps, so it fits. Also: when
   resident leaves ANY component off-GPU it is now treated as a failed
   load — the partial pipe is torn down and it falls through to model
   offload, instead of returning a half-loaded pipe that pins ~10 GB.

2. Teardown leak. _free_pipeline_vram now breaks the references that
   outlived a plain component-null: reset any non-default attention
   backend, unload LoRA/PEFT adapters, run the pipe's own
   maybe_free_model_hooks()/reset_device_map() (frees accelerate offload
   hooks + their staging buffers), and drop the _coderai_* stamped attrs,
   before nulling components + gc + empty_cache.

Verified live (image 0.1.33): 31 video units rendered, 0 OOM, 0 leak
diagnostics, idle GPU back to ~6 GB.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

e1ab02b1

township: make the tool web UI mobile-friendly · e63b008a

Stefy Lanza (nextime / spora ) authored Jun 30, 2026

Add a @media (max-width:640px) block to the shared _CSS injected into every
township page via _page(). On narrow screens: stack the .row/.row3/.modal .row2
form grids to one column, wrap the nav bar, shrink the modal to fit a 320px
viewport (min-width:0; width:94%), make the fixed-width 215/230px tile cards
full-width, render inputs at 16px to stop iOS Safari focus-zoom, and give
buttons roomier wrap-friendly tap targets. Desktop layout is unchanged.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

e63b008a

packaging: build SageAttention into the OCI image (non-fatal, arch-gated) · c4b4b884

Stefy Lanza (nextime / spora ) authored Jun 30, 2026

The diffusers video path uses SageAttention (INT8 attention) when available
for faster Wan2.2 rendering. Like flash-attn it is CUDA-arch-sensitive, so
it is built from source in the devel/builder stage against the just-installed
torch, gated by BUILD_SAGEATTENTION (default 1) and SAGEATTENTION_ARCH
(default 8.6 = RTX 3090). The build is non-fatal: on failure the image still
works and the runtime attention-backend resolver falls back to flash/SDPA.

build_oci_image.sh passes the three new args through (overridable via env:
BUILD_SAGEATTENTION / SAGEATTENTION_REF / SAGEATTENTION_ARCH).

Note: the fast update_oci_image.sh overlay is based on the runtime image (no
nvcc), so it cannot build SageAttention — a full build_oci_image.sh is needed
to bake it in.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

c4b4b884

29 Jun, 2026 10 commits

video: gate 'auto' attention on diffusers' real Sage usability · fcd74214

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

The 'auto' attention-backend resolver picked 'sage' whenever the
sageattention package was merely importable. But diffusers requires a
specific version with compiled CUDA kernels (0.38 needs >=2.1.1); an old
v1 wheel is importable yet rejected, which would silently drop to plain
SDPA instead of Flash. Gate on diffusers' own _CAN_USE_SAGE_ATTN (with an
is_sageattention_version fallback) so 'auto' chooses sage only when it can
actually dispatch, else flash, else SDPA.

SageAttention 2.2.0 is now built from source in the venv (CUDA kernels,
TORCH_CUDA_ARCH_LIST=8.6), so 'auto' resolves to sage on this host.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

fcd74214

video: resident all-GPU experts + faster attention backend (flash/sage) · fe2cdee1

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

Two opt-out speedups for the diffusers video path (both default ON), aimed
at phase-3 render time on the dual-expert Wan2.2-VACE-Fun-A14B.

1) Resident experts (video_resident_experts, default on)
   The injected bnb-4bit components previously forced hook-based 'model'
   CPU offload, which re-shuffles modules GPU<->CPU on every clip part
   (text-encode, VACE-encode, VAE-decode all pay a transfer). Now the
   whole pipeline (both 4-bit experts + text encoder + VAE, ~20 GB) is
   loaded RESIDENT on the 24 GB card via a new 'resident' load strategy:
   build with the injected components, then .to('cuda') every component,
   no offload hooks. The denoise loop and en/decode run on resident
   weights with zero per-part shuffling.
   Fully fallback-safe: on an activation-peak OOM the loader degrades to
   'model' offload, and the generation ladder gains a ('model', True)
   rung as the first fallback when rung 0 was resident — i.e. exactly the
   previous behaviour. Set video_resident_experts=false to force it.
   record_vram_delta now treats 'resident' as on-GPU (not offloaded) so
   the measured footprint is the true GPU delta.

2) Attention backend (video_attention_backend, default 'auto')
   Switch the transformer(s) to a faster attention backend via diffusers
   0.38's set_attention_backend, applied at the single _report_loaded
   chokepoint (covers initial + every fallback reload). 'auto' prefers
   SageAttention (INT8) if installed, else FlashAttention (flash_attn is
   installed), else leaves the default SDPA. Per-component try/except so
   an unavailable backend is a logged no-op.
   Self-heal: if a non-OOM generation failure occurs while a non-default
   backend is active (a flash/sage shape incompat on this torch build),
   the transformers are reset to default attention and the SAME pipe is
   retried once before any costly reload rung.

Config: both read per-model (models.json) first, then global_args, then
the default — no config change needed to get the speedups. Untested on a
live render (no GPU run this session); behaviour falls back to the prior
path on any failure.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

fe2cdee1

township: gate enhance + odds + ZIP + upload behind full-match checkboxes · f4ce8539

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

A whole-match regen (scope=full) now exposes two checkboxes on the match
detail page that drive the finalization tail of the pipeline:

  - "Finalize & package" (default on): after assembly, run the 2x AI
    upscale + 2x frame interpolation pass, generate arbitrage-safe odds,
    and pack the renamed upload ZIP (each of the nine slots picked from
    its highest-quality variant via _best_variant).
  - "Upload after" (default = upload_after_render): also push the prepared
    match to the configured Township endpoint. Implies package; refuses
    cleanly (no server call) when the endpoint isn't configured.

The full job is now seven phases (was four): prompts, keyframes, render,
assemble (always) + enhance, odds/ZIP, upload (gated). The enhance pass
was previously always-on; it now sits under the package checkbox (default
checked, so existing behaviour is preserved) — unchecking gives a fast
base-res iteration with no upscale/odds/zip.

Wiring: added package/upload to the /matches/render params allowlist;
reMatch() reads the checkboxes for scope=full, folds them into the
confirm dialog, and posts them; odds/ZIP via prepare_match_odds_zip and
upload via upload_prepared_match (progress piped into the job bar).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

f4ce8539

video: malloc_trim host RAM after each clip (stop glibc arena from hoarding decode buffers) · 6bbc2d31

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

The video pipeline is reused across clips (model CPU offload keeps weights in host
RAM), so the per-request teardown that trims the heap — _free_pipeline_vram ->
_trim_cpu_ram — never runs between clips. Each generation allocs ~1 GB of
decode/latent/offload buffers; Python frees them but glibc keeps the pages in its
arena, so RSS drifts up clip-by-clip (seen ~34 -> 47 GB) and peaks over the
max_ram_gb cap. The cap can't reclaim it: the only resident model is the protected
live one, so there's nothing idle to evict (and ram_leak_watch is off).

Fix: after each successful generation, drop the decode buffers (frames/frame_np),
gc, and malloc_trim the freed heap back to the OS. Logs RSS before/after so a
RESIDUAL reference leak (RSS not returning to baseline after trim) is visible to
chase separately. Failure paths already trim via _free_pipeline_vram.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

6bbc2d31

manager: universal footprint rule — already-quantized (cached) models measured... · 88c1e787

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

manager: universal footprint rule — already-quantized (cached) models measured as-is, no quant factor

Per directive: the VRAM estimate must always equal the model's ACTUAL full pipeline
footprint, by a universal rule rather than per-model tuning. The quant factor exists
to convert an UNQUANTIZED source's weight size down to its quantized runtime size —
so it must NOT be applied to weights that are ALREADY quantized on disk.

Add _quantized_cache_gb(): find this model's quantized pipeline cache by MODEL NAME
(glob '<safe_name>__*', ignoring in-progress '.building' dirs; signature suffix
drifts with trivial config edits, so name-match is the robust signal) and return its
real on-disk size. _get_model_used_vram_gb now, right after the forced-measurement
check, returns that cache size AS-IS (+ reserve) when present — no quant_mult, no
precision factor. The cache holds already-quantized weights, so its size IS the
footprint, and it overrides a stale used_vram_gb (e.g. Wan2.2-I2V's 151 GB fp32 disk
size). Models quantized AT LOAD (no cache yet) still fall through to the factor-based
estimate. Verified: VACE-Fun now estimates ~24 GB (real 4-bit footprint) instead of
24.5 x 0.283 = 6.9 GB, so eviction frees the idle image model before the forward.

Reverted the per-model measured_vram_gb pin in models.json — the universal rule
covers it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

88c1e787

manager: measured_vram_gb = FULL model footprint (GPU + offloaded), not the resident slice · 273eb23c

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

Root cause of the persistent ~11 GB VRAM "base" and the big-clip OOMs: the video
model's persisted measured_vram_gb was 0.296 GB — the tiny GPU-resident slice of an
OFFLOADED load. With force_vram_update it overrode used_vram_gb=24.5, so
ensure_vram_for(video) thought the model needed ~0 GB and NEVER evicted the idle
Z-Image image model (~15 GB). That left ~11 GB resident; the video forward then
needed ~10 GB on top → 21 GB → OOM. (The whole GPU is coderai's — that base was
our own un-evicted model.)

Fix (per directive): measured_vram_gb now represents the FULL model footprint —
GPU-resident weights + the portion offloaded to host RAM + runtime reserve — i.e.
placement-independent total memory the model needs. record_vram_delta computes it
from the GPU delta PLUS the host-RAM delta (video path now passes ram_before), and
force_vram_update persists that full number every run. So eviction/strategy
provision for the real need and free room (evict the idle image model) for the
forward. Also cleared the stale 0.296 from models.json so it falls back to
used_vram_gb=24.5 until re-measured.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

273eb23c

video: OOM fallback LADDER (no 5xx until exhausted) + monolithic cache finalize · b8bcdfea

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

From the latest render log: the per-component cache + same-model reuse worked
(clip 02 ran on clip 01's resident pipe, LoRA-synced, no reload), but BIG clips
(46-50 frames) OOM'd in the dual-expert forward under 'model' CPU offload, and the
single retry then crashed two ways:
  - NameError: loading_task was imported only in the load-only block, so on a
    REUSED pipe the retry's `with loading_task(...)` was unbound.
  - enable_sequential_cpu_offload on injected bnb-4bit components hits
    "Params4bit.__new__() got an unexpected keyword argument '_is_hf_initialized'".
So every big clip returned 500 instead of recovering.

Fixes:
1. Generation fallback LADDER. On a recoverable failure (OOM / device-mismatch) we
   no longer return 5xx — we free the pipe (outside the except, so the failed
   forward's traceback can't pin its VRAM) and reload with the next strategy, then
   retry generation. Return on the FIRST success, or 500 only after EVERY rung
   fails. Rungs: [already-loaded] -> balanced -> disk, all PLAIN device_map builds
   (incremental=False) so device_map places every component itself — sidestepping
   the injected-component failure modes ('model' OOM, 'sequential' Params4bit,
   device-map-can't-place-injected). loading_task imported up-front.
2. Monolithic cache finalize. After the per-component heavy build, complete the
   cache DIR (save light components + model_index.json + marker) so later loads use
   the monolithic HIT path: from_pretrained(dir, device_map=balanced), NO injection
   — big-clip generation fits AND loads pre-quantized from cache (fast, no
   re-quantize). Conservative: only marked complete when every component is present.

Net: first clip builds+finalizes the cache; thereafter loads are balanced-from-cache
(fits any clip size, no injection bugs), same-model clips reuse, and a transient OOM
walks the ladder instead of failing the clip.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

b8bcdfea

manager: sweep orphan VRAM + untracked-VRAM diagnostic before every video load · 2ab70301

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

Belt-and-suspenders for the "untracked in-use VRAM" failure class (a teardown that
drops a model's registry entry but leaves GPU tensors resident — e.g. activations
pinned by an exception traceback):

- manager.sweep_orphan_vram(): gc.collect() + empty_cache() reclaims the
  UNREFERENCED kind (a just-dropped-but-not-collected pipeline), and logs a
  diagnostic when the card holds materially more VRAM than tracked models account
  for (the referenced kind, which can't be freed here — must be fixed at source).
- manager._estimate_tracked_vram_gb(): sums tracked models' footprints for that
  check (offloaded models over-report, so the check is conservative, never a false
  positive on a legitimately offloaded model).
- video: call it at the start of every load — right after request_model's eviction
  — so a model swap is verified to have actually freed the outgoing model's VRAM,
  and any stray orphan is swept before the new model loads.

Same-model consecutive requests already reuse the resident (offloaded) pipe via
request_model, so this only runs on a genuine (re)load.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

2ab70301

video: revert to model offload for injected components (the leak, not model offload, was the OOM) · 1dbf15ed

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

Follow-up to the generation-OOM leak fix: forcing 'sequential' was treating the
symptom. The real cause of "model offload OOMs generation" was that the card
already had ~11 GB LEAKED from the prior clip's failed forward (its exception
traceback pinned the activations). model offload itself adds ~0 to the GPU (weights
stay on CPU); subtract the leak and the forward fits in 24 GB.

With the leak fixed (free + retry now happen outside the except, so the pinned
activations are collectable), loads start from a clean card, so force the faster
'model' CPU offload for injected components instead of sequential. The leak-free
retry still degrades to sequential only if a genuinely clean-card forward OOMs.

Why load-time cleanup couldn't fix it on its own: the leaked VRAM was neither a
tracked model (already popped from the registry) nor free-able (still referenced by
the traceback), so manager eviction had nothing to evict and empty_cache() — which
only reclaims unreferenced memory — was a no-op. The only fix is dropping the
reference at its source, which the traceback fix does.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

1dbf15ed

video: fix generation-OOM VRAM leak (traceback pin) + use sequential offload... · bc31e2d5

Stefy Lanza (nextime / spora ) authored Jun 29, 2026

video: fix generation-OOM VRAM leak (traceback pin) + use sequential offload for injected components

Two issues from the latest debug.log on the cached VACE model:

1. Generation-OOM VRAM leak (the cascade driver). The OOM/retry handler reloaded
   INSIDE `except Exception as e:`, so sys.exc_info() kept the failed forward pass's
   traceback alive — pinning its GPU activations (and the resident expert).
   _free_pipeline_vram() then reclaimed nothing, the retry reload saw the card still
   ~21 GB full, and every fallback (sequential/disk) OOM'd in turn -> 500. Same
   traceback-pinning bug fixed earlier for the load ladder. Restructured: capture
   the error, drop e.__traceback__, and do the free + gc + empty_cache + retry
   OUTSIDE the except block, where the activations are finally collectable.

2. model CPU offload OOMs this dual-expert forward. With injected cached components
   we forced 'model' offload, which keeps a whole ~7 GB expert resident; the A14B
   forward then spiked past 24 GB (loaded at 11 GB, OOM mid-generate). Force
   'sequential' instead — minimal footprint that reliably fits; the cache still
   makes the LOAD fast. (offload_strategy=group remains available for more speed.)

Retry policy unified: OOM -> sequential; device-mismatch -> incremental cache off
(plain device_map build hooks everything itself). Both retry once, then 500.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

bc31e2d5

28 Jun, 2026 4 commits

video: hook-based offload for injected cache components (fix device mismatch); gen retry safety net · 9ed35326

Stefy Lanza (nextime / spora ) authored Jun 28, 2026

The incremental per-component cache built and cached all components correctly, but
generation then failed:

  RuntimeError: Expected all tensors to be on the same device, but got index is on
  cuda:0, different from other tensors on cpu (... wrapper_CUDA__index_select)

Cause: the load used the `balanced` (device_map) strategy. accelerate's device_map
only dispatches components it LOADS — injected pre-loaded components keep whatever
device they're on (CPU) with NO offload hook, so a forward pass mixes cuda inputs
with cpu weights.

Fix: when incremental cached components are injected, force a HOOK-based offload
(model → group → sequential), which add offload hooks to EVERY component in the
pipeline (injected included). model CPU offload keeps only the active ~7 GB 4-bit
expert resident, so it fits easily and is fast. device_map strategies
(balanced/disk) are skipped in this case.

Safety net: a non-OOM generation failure that looks like a device mismatch now
retries ONCE with the incremental cache disabled (plain build = pre-cache
behaviour), so generation self-heals even if the injection path is imperfect
instead of wedging the queue. (Load-time failures already fall back this way.)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

9ed35326

video: incremental per-component pipeline cache (load/build/cache one component at a time) · 93e2a4b9

Stefy Lanza (nextime / spora ) authored Jun 28, 2026

A Wan2.2 A14B pipeline that only loads via offload could never be cached — diffusers
can't save_pretrained an offloaded pipe, so every load re-quantized from bf16 shards
(~14 min) and risked OOM. Cache it per-component instead:

- pipeline_cache: per-component subdirs each with their own completion marker +
  signature (component_dir / component_valid / save_component), atomic via .building
  + sweep_stale(). A load evicted before all components are cached keeps the finished
  ones; the next load mixes cached + freshly-built (now-cached) ones, converging to a
  full cache. (Plus the earlier sweep_stale / _unsavable_reason robustness.)

- hf_loading.build_cached_components: for each bnb-quantizable heavy component, load
  from the per-component cache if present, else build fresh from the model — each ONE
  AT A TIME on the GPU (where bnb quantizes it to uniform 4-bit), saved, then moved to
  CPU + empty_cache. Peak VRAM is a single component even when the whole model doesn't
  fit; uniform 4-bit by construction (no bf16/4-bit mix). Verified the bnb round-trip
  (load 4-bit on GPU -> save_pretrained -> to('cpu') -> reload).

- video._load_video_pipeline: inject the cached components like GGUF components so the
  existing offload ladder assembles them. Kill-switch CODERAI_INCREMENTAL_CACHE=0.
  Fully fallback-safe: on any load failure the request handler retries once with
  use_incremental_cache=False (plain build = today's behaviour).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

93e2a4b9

video: stop the offload retry death-spiral; log swallowed load/gen errors; robust pipeline cache · e48b90e0

Stefy Lanza (nextime / spora ) authored Jun 28, 2026

Three fixes for the Wan2.2-VACE-Fun OOM cascade:

1. Offload retry death-spiral (the OOM driver). When from_pretrained OOMs
   MID-LOAD, `pipe` is never assigned, so the ~20 GB already placed on the GPU is
   pinned by the EXCEPTION'S TRACEBACK frames (their locals hold the partial
   model), not by `pipe`. _clear_mem() frees via `pipe` (None) so it reclaims
   nothing, and the next balanced step recomputes its GPU budget from a card still
   ~20 GB full (70%->17.5, 60%->3.4, 40%->2.3 GiB — all doomed). Drop
   `e.__traceback__` and run _clear_mem() OUTSIDE the except block (where
   sys.exc_info no longer pins the frames) so the stranded tensors are collectable
   before the next attempt. Applied to the balanced chain and the full-GPU and
   model-offload handlers.

2. Observability. The load-failure and generation-failure handlers raised
   HTTPException(500) with the cause only in the HTTP detail — debug.log showed a
   bare "Response status: 500". Log the full traceback in all three handlers.

3. Pipeline cache robustness. diffusers refuses save_pretrained on a CPU/seq
   offloaded pipeline, and a device_map/disk-offloaded one has meta tensors —
   either way the save half-writes a .building dir then a killed process leaves it
   orphaned (tens of GB). Add _unsavable_reason() to skip those cleanly (no junk),
   and sweep_stale() to delete orphaned *.building dirs. A full-GPU load stays
   savable and still caches the 4-bit pipeline.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

e48b90e0

manager: don't strand VRAM or churn the live model on eviction · bb95ab1d

Stefy Lanza (nextime / spora ) authored Jun 28, 2026

Two eviction fixes folded into 0.1.30:

- VRAM eviction: a diffusers pipeline under device_map/accelerate offload
  (balanced/disk/model/sequential) or bitsandbytes quantization REJECTS
  .to('cpu') — the naive move silently left the weights resident and stranded
  VRAM, feeding the OOM death-spiral where the next load OOMs on a near-full
  card. Route pipelines (those exposing `components`) through the thorough
  _free_pipeline_vram (remove accelerate hooks -> drop component refs ->
  empty_cache); plain non-pipeline models still use a simple .to('cpu').

- RAM eviction: never unload the LIVE model (active_in_vram OR current_model_key)
  that a request loop keeps reusing. An offloaded pipeline's host RAM IS the live
  model, so evicting it between same-model requests just forces an immediate
  reload — churning VRAM/RAM and re-stranding device_map weights while freeing
  nothing lasting. The last-resort active-model eviction now fires only for a
  STALE active model (no longer the current one).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

bb95ab1d

27 Jun, 2026 3 commits

router: pin video generation to a capable engine (never the gguf-only sibling) · cc6db025

Stefy Lanza (nextime / spora ) authored Jun 27, 2026

Two bugs let a torch video pipeline run on the nvidia-gguf engine (caps={"gguf"})
when the nvidia engine was briefly down mid-restart:

1. _INFERENCE_PATHS listed "/v1/videos/generations" (plural) but the endpoint is
   "/v1/video/generations" (singular, video.py:3104). So video requests failed
   is_inference_path(), skipped all capability-aware routing, and fell through to
   registry.primary() — which returns the first HEALTHY engine when the primary
   (nvidia) is unhealthy, i.e. the gguf sibling. Fixed the path.

2. pick_engine() step 5 fell back to a capability-BLIND least_loaded(None) pick,
   so a typed request could still land on an engine lacking the capability. Now a
   typed request (transformers/gguf/whisper) only ever picks a capable engine —
   preferring the primary when it can serve the cap (request queues / caller
   retries), else 503 — instead of mis-routing to an incompatible engine.

Net: video generation (cap=transformers) routes to the nvidia engine only; if it
is busy it queues there, if it is restarting the caller retries — it never runs a
torch pipeline on the gguf-only engine. Bumps to 0.1.30.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

cc6db025

gpu: scope per-engine VRAM to the engine's own card (fix Radeon showing NVIDIA) · ff39ee42

Stefy Lanza (nextime / spora ) authored Jun 27, 2026

The context-free VRAM query used gpu_memory(), which (via NVML/nvidia-smi)
reports every physical NVIDIA card regardless of CUDA_VISIBLE_DEVICES. So the
Vulkan/Radeon engine (CUDA_VISIBLE_DEVICES="") showed the NVIDIA 3090's VRAM in
its status instead of its own AMD card.

Switch the per-engine status poll (api/app.py) and the capability probe
(broker/capabilities.py) to visible_gpu_memory(), which honours
CUDA_VISIBLE_DEVICES (index OR UUID; empty -> no CUDA cards). Add
gpu_query.amd_gpu_memory() (amdgpu sysfs, driver-free) as the fallback so the
Radeon engine reports its real card; capabilities keeps its torch-loaded guard so
the torch-free front still enumerates the whole node. Bumps to 0.1.29.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

ff39ee42

gpu: context-free VRAM query so idle/GGUF engine pins no CUDA context · 9c150b3e

Stefy Lanza (nextime / spora ) authored Jun 27, 2026

Reporting VRAM via torch.cuda.mem_get_info lazily creates the CUDA primary
context (~256 MiB on an RTX 3090). An engine that never loads a torch model
(the GGUF/llama.cpp engine) therefore pinned ~256 MiB just to answer health
polls, the capability probe and the load-path eviction check — and that stray
context was enough to tip a borderline 4-bit Wan2.2 A14B video load into OOM.

New codai/models/gpu_query.py queries VRAM without a context: pynvml first,
nvidia-smi fallback, torch only if a context already exists. visible_gpu_memory()
scopes to the engine's cards via CUDA_VISIBLE_DEVICES (matched by index OR UUID;
empty value -> no CUDA cards, e.g. the Vulkan/Radeon engine).

Wired into the idle health poll (api/app.py), the capability probe
(broker/capabilities.py) and the load-path free-VRAM checks
(models/manager.py: _get_free_vram_gb / _free_vram_snapshot). Adds nvidia-ml-py
to requirements-nvidia.txt and the update overlay. Bumps to 0.1.28.

Net: the GGUF engine sits at 0 MiB while idle (its real context comes from
llama.cpp on model load and is freed on unload), returning the headroom that
makes the A14B load fit instead of OOM-ing.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

9c150b3e

26 Jun, 2026 7 commits

fix: drop stale per-request LoRA adapters before re-applying · 59416db8

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

The diffusers pipeline is cached/reused across image requests, so the adapters
_apply_loras added last time linger; re-loading the same fighter then raised
"Adapter name <x> already in use", and stale adapters from a different request
accumulated. Track the request adapters on the pipeline and delete them (plus a
defensive per-name delete) before re-loading, leaving acceleration adapters
(__accel__/__accel_2__) untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

59416db8

fix: apply peft<->gptqmodel AWQ shim on the inference LoRA path too · 3fb6c9b9

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

Loading a trained LoRA at inference (e.g. a fighter LoRA on an image pipeline)
crashed with "cannot import name AwqGEMMQuantLinear" because peft dispatches AWQ
for any non-bnb target when gptqmodel is installed, and gptqmodel 7.1.0 renamed
that class to AwqGEMMLinear. The alias shim existed only in the training path.

Extract it to codai/models/peft_compat.ensure_peft_awq_compat() (cached) and call
it before load_lora_weights in codai/api/images._apply_loras and the acceleration
fuse path; the trainer now delegates to the shared shim.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

3fb6c9b9

proxy: fix double-prefix when outer proxy mounts a tool at the same name · 2b6942d7

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

The 0.1.24 nesting blindly appended the bundled sub-app name to the incoming
X-Forwarded-Prefix, which double-counted it when the outer proxy mounts the app
under the SAME name and proxies to our /<app>/ path (topology B): outer sends
"/township" -> container emitted "/township/township", so every township
fetch() 404'd ("Not found" -> JSON parse error in the UI).

Per-app maps now only append our name when the incoming prefix doesn't already
end in it, so both topologies work: outer-wraps-everything ("/ai" -> "/ai/town
ship") and outer-mounts-same-name ("/township" -> "/township"). Direct (no outer
proxy) still yields "/township".
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

2b6942d7

proxy+training: chain-aware nginx for double-proxy; report training model as loaded · f177d01d

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

Internal container nginx is now chain-aware: it prefers the outer proxy's
X-Forwarded-Proto/Host/Prefix and nests bundled sub-app prefixes under any
outer prefix (outer /ai + /township -> /ai/township). Fixes characters/
environments thumbnails and other absolute/sub-path URLs 404ing when the
all-in-one container runs behind a second reverse proxy. Documented the
required outer-proxy headers in docs/reverse-proxy-nginx.md.

LoRA training loads its base pipeline outside the model manager (and unloads
all manager models first), so the engine reported 0 loaded models mid-train.
Surface the active training base model via active_training_model() so the
engines card reflects the busy GPU.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

f177d01d

tasks page: engine indicator, live image-gen progress, per-task cooling · 931b5b85

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

- Show which engine each task runs on (badge on active AND history rows; backend
  already tags t.engine) and an "● processing" badge on the engines card (new
  inflight/processing fields in /admin/api/engines).
- Image-gen advancement showed only "working…": poll() short-circuited to the
  synthesized task list whenever the primary had any in-flight request, dropping the
  engine's real step/total. Now it quick-polls the engine first (image/diffusers gen
  releases the GIL so it answers with real progress) and only falls back to the
  synthesized list when a GIL-bound text gen times out.
- Surface thermal cooldown on the task entry itself: _merge_engine_tasks marks tasks
  on a cooling engine with cooling/cooling_message (the row already renders it).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

931b5b85

loras: peft

↔

gptqmodel AWQ compat shim (fixes add_adapter ImportError) · df8939fe

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

LoRA add_adapter() crashed: peft's dispatch_awq does `from
gptqmodel.nn_modules.qlinear.gemm_awq import AwqGEMMQuantLinear`, but gptqmodel
7.1.0 renamed that class to AwqGEMMLinear. peft calls dispatch_awq for ANY non-bnb
target when gptqmodel is installed, so the failed import broke every add_adapter
(SDXL/Wan/Z-Image), not just AWQ models.

Add _ensure_peft_awq_compat(): alias AwqGEMMQuantLinear -> AwqGEMMLinear so peft's
import succeeds and falls through to the correct dispatcher. Called before every
add_adapter() (sd15/sdxl/dit/wan).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

df8939fe

loras: match VAE input dtype in Z-Image trainer (bf16 pipeline VAE) · ff6002d6

Stefy Lanza (nextime / spora ) authored Jun 26, 2026

QLoRA training reached VAE encode then failed: the ZImagePipeline loads the VAE in
bf16, but the image tensor was fed as float32 -> conv2d "Input type (float) and bias
type (BFloat16) should be the same". Feed the VAE its own weight dtype.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

ff6002d6

25 Jun, 2026 4 commits

loras: load Z-Image via ZImagePipeline (fix 4-bit text-encoder quant-state crash) · 6d83cb5e

Stefy Lanza (nextime / spora ) authored Jun 25, 2026

QLoRA training crashed in the Qwen3 TEXT ENCODER: the unsloth build quantizes it
too, and loading components piecemeal (AutoModel/AutoencoderKL by subfolder) left
bitsandbytes' 4-bit quant-state unreconstructed -> "FP4 quantization state not
initialized" / AssertionError in Linear4bit.forward.

- Load all components via ZImagePipeline.from_pretrained (the proven inference
  loader, which loads each 4-bit component correctly), then LoRA-train only the
  transformer. VAE/text-encoder are encoded once then freed (4-bit modules can't
  .to('cpu'), so drop refs + free).
- Disable gradient checkpointing for the 4-bit path: the recompute can desync bnb's
  quant-state. At 512/batch-1 the 4-bit base + activations fit without it.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

6d83cb5e

loras: train Z-Image LoRA via 4-bit QLoRA (fast, uses the cached turbo build) · 26bdb59e

Stefy Lanza (nextime / spora ) authored Jun 25, 2026

Training on the full bf16 Tongyi-MAI/Z-Image-Turbo was extremely slow — a ~10-min
download plus a heavy bf16 model that doesn't fit cleanly on 24 GB. Switch _train_dit
to QLoRA: load the transformer in 4-bit (frozen, ~4 GB, no CPU offload) and train the
LoRA on top. This trains directly on the already-cached 4-bit (e.g. unsloth) build —
no redirect to the full model, no download.

- Load transformer with diffusers BitsAndBytesConfig (nf4); an already-4-bit
  checkpoint (embedded quant config, e.g. unsloth) is loaded as-is via fallback.
- Enable gradient checkpointing and force the input-embedding (all_x_embedder)
  output to require grad so QLoRA grads reach the attention LoRA layers; hooks
  removed at job end.
- Drop the quantized-base -> full-model redirect added earlier.

LoRA still applies to the quantized model at inference (identical architecture).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

26bdb59e

loras: route Z-Image to _train_dit by name; train quantized base via full model · 7e4e7eba

Stefy Lanza (nextime / spora ) authored Jun 25, 2026

Env/fighter LoRA training for Z-Image misrouted to _train_sdxl (CLIPTokenizer
crash: "NoneType cannot be interpreted as an integer"). Cause: _resolve_base_model_path
returns the HF id verbatim (e.g. unsloth/Z-Image-Turbo-unsloth-bnb-4bit) when the
model has no local `path`, so the isdir(base_path/transformer) DiT check was False
and it fell through to the SDXL/SD15 trainer.

- Detect Z-Image by NAME (id contains z-image/zimage) in addition to the
  diffusers-config class name, so an HF-id base routes to _train_dit.
- _train_dit: when the base is a pre-quantized (bnb/nf4/4bit) build, train against
  the full Tongyi-MAI/Z-Image-Turbo instead (a 4-bit checkpoint isn't a clean
  full-precision LoRA base); the LoRA still applies to the quantized model at
  inference. Overridable via lora_train_base_model.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

7e4e7eba

quant: fix settings showing "GPTQModel not installed" when it is installed · e72e33eb

Stefy Lanza (nextime / spora ) authored Jun 25, 2026

is_available() reported false in the running engine even though gptqmodel + fast
kernels import fine in a fresh process. Two causes:

1. capabilities() cached a DEGRADED result (gptqmodel imported but the inner
gptqmodel.utils.backend BACKEND import transiently came up empty, e.g. when the
first call landed mid model-load). That empty-backends result stuck for the whole
process life, so the settings page said "GPTQModel not installed" until restart.
Now a degraded (available-but-no-backends) result is NOT cached — re-detect next
call; only a clean positive or a genuine ImportError is cached.

2. is_available() gated on a SPECIFIC fast kernel being detected. GPTQModel always
has a Triton/torch fallback and picks the kernel at load, so availability now
gates only on gptqmodel importing; backends stay informational.

Also: /admin/api/quantize-capabilities re-detects live (capabilities(refresh=True))
so the settings page never serves a stale cache.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RdMufYvtTbtGDWsiZVoXce

e72e33eb