Backends, API, and tooling updates; gitignore township_output

- cuda/vulkan backend improvements and config plumbing - API updates across characters, text, environments, audio, embeddings, tts - admin chat/settings template updates - add hf_loading helper, video request fields, platform paths - new docs (CODERAI_API_DOCUMENTATION.md) and tools (review_outputs, video_dubber) - ignore generated township_output/ Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Backends, API, and tooling updates; gitignore township_output
- cuda/vulkan backend improvements and config plumbing - API updates across characters, text, environments, audio, embeddings, tts - admin chat/settings template updates - add hf_loading helper, video request fields, platform paths - new docs (CODERAI_API_DOCUMENTATION.md) and tools (review_outputs, video_dubber) - ignore generated township_output/ Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f21c6185 · Stefy Lanza (nextime / spora ) · 7dc60f66 · f21c6185 · f21c6185 · f21c6185
Commit f21c6185 authored Jun 10, 2026 by Stefy Lanza (nextime / spora )
23 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -26,3 +26,6 @@ test_*.py

 # Local git worktrees
 .worktrees/
+
+# Generated township fighter outputs
+township_output/
--- a/AI.PROMPT
+++ b/AI.PROMPT
+# AI.PROMPT — coderai architecture notes for AI assistants
+
+This file records non-obvious architecture and invariants that AI assistants
+(and humans) must respect when working on coderai. Read it before changing model
+loading, configuration handling, or VRAM/eviction logic.
+
+================================================================================
+## Configuration is the single source of truth
+================================================================================
+
+Every model's behaviour comes from its entry in `~/.coderai/models.json`.
+Loaders MUST read settings from the per-model configuration, NOT from CLI args
+(`global_args`). CLI flags are not used to configure per-model behaviour.
+
+### Uniform models.json schema
+Every model entry — regardless of type — carries the same base fields:
+
+    load_in_4bit, load_in_8bit      # quantization (bitsandbytes)
+    flash_attention                 # NOTE: key is 'flash_attention', NOT 'flash_attn'
+    offload_strategy                # 'auto' | 'none' | 'cpu' | 'sequential' | 'model' | 'disk'
+    offload_dir                     # disk-offload directory
+    max_gpu_percent                 # cap GPU budget 0-100
+    manual_ram_gb                   # explicit CPU RAM budget
+    no_ram                          # GPU-only, no CPU spill
+    n_ctx                           # context size. NOTE: key is 'n_ctx', NOT 'context_size'
+    n_gpu_layers
+    used_vram_gb                    # raw (full-precision) VRAM estimate
+    precision                       # 'bf16' | 'f16' | 'f32'
+
+### Key-name traps (caused silent config-ignored bugs)
+- Use `flash_attention`  (legacy code wrongly read `flash_attn`)
+- Use `n_ctx`            (legacy code wrongly read `context_size`)
+Always read the canonical key first, with the legacy key only as a fallback.
+
+================================================================================
+## Where config is translated and applied
+================================================================================
+
+### main.py :: build_kwargs_from_config(model_cfg, model_type)
+Emits a COMMON BASE of kwargs (quant/offload/flash/memory/n_gpu_layers) for
+EVERY type, plus type-specific extras, plus the original entry as `_raw_cfg`.
+No model type may be left without the common base.
+
+### codai/models/hf_loading.py  (shared helper — transformers / HF pipeline)
+Translates a model config into `from_pretrained` kwargs:
+- `build_from_pretrained_kwargs(cfg)` → torch_dtype, transformers
+  `BitsAndBytesConfig` (4-bit nf4 / 8-bit), `device_map='auto'` + `max_memory`
+  (GPU→CPU split), `offload_folder` (disk overflow), `attn_implementation`.
+- `build_quantization_config(cfg)` → transformers BitsAndBytesConfig or None.
+- `pipeline_device_kwargs(cfg)` → kwargs for HF `pipeline(...)` (uses model_kwargs).
+- `resolve_dtype(cfg, default)` → torch dtype from `precision`.
+Used by: spatial (depth/segmentation), embedding, audio_gen (AudioLDM2).
+
+### diffusers loaders (image, video, audioldm)
+Two shared helpers in `codai/models/hf_loading.py`, both keyed off the per-model
+`component_quantization` map (e.g. `{"transformer":"4bit","text_encoder":"8bit",
+"vae":"none"}`):
+- `build_pipeline_quant_config(model_name, cfg, dtype)` → diffusers
+  `PipelineQuantizationConfig` (`quant_mapping`) for in-place quantization.
+  Supported per-component modes:
+    * "4bit"/"8bit" → bitsandbytes (always available)
+    * "2bit"       → optimum-quanto int2 (needs `pip install optimum-quanto`;
+                     skipped with a warning if missing — bnb CANNOT do 2-bit)
+    * "none"/omitted → full precision
+  diffusers BnB for transformer/unet/vae; transformers BnB/Quanto for
+  text_encoder*. No override → global `load_in_4bit`/`8bit` on all heavy
+  components (MUST include UMT5 `text_encoder` for Wan2.2 or it won't fit GPU).
+- `build_gguf_pipeline_components(model_name, cfg, dtype)` → for any
+  `component_quantization` value that is a `*.gguf` path/URL, loads that
+  component via `<Class>.from_single_file(..., GGUFQuantizationConfig)` and
+  returns it to be injected as a pipeline kwarg (e.g. `transformer=<model>`).
+  This is the ONLY way to get 5-bit/6-bit (Q5_K/Q6_K) — they don't exist in
+  bnb/quanto; needs a pre-quantized GGUF file + the `gguf` lib. diffusers
+  components only.
+- Components discovered from `DiffusionPipeline.load_config(model_name)`.
+- Configurable in models.json (`component_quantization`) and the Admin UI
+  (Per-component quantization section: Default/2-bit/4-bit/8-bit/GGUF-file/None
+  per component).
+- bitsandbytes = 4/8-bit ONLY. 2-bit = quanto. 5/6-bit = GGUF files only.
+IMPORTANT: bitsandbytes-quantized pipelines/models are placed on GPU during
+`from_pretrained` and CANNOT be moved with `.to()` afterwards — load via
+`device_map` and skip the explicit `.to(device)` when a quant config is active.
+
+### text / vision
+Go through `codai/backends/cuda.py :: NvidiaBackend.load_model`, which already
+reads `flash_attn`, `load_in_4bit`, `load_in_8bit`, `offload_strategy`,
+`max_gpu_percent`, `no_ram`, `offload_dir` from kwargs.
+
+================================================================================
+## Loading strategy (GPU-max, then spill)
+================================================================================
+
+The GPU should always be used as much as possible, then spill to CPU RAM, then
+to disk. Never go straight to CPU unless explicitly configured (`no_ram=True`
+or `offload_strategy='none'` means GPU-only; pure-CPU is a last resort only when
+no CUDA device exists or every GPU+RAM+disk attempt failed).
+
+GPU budget is capped at `min(total × fraction, free_vram − 512MB headroom)` so
+we never request more VRAM than is physically free; overflow goes to CPU/disk
+via `device_map='auto'` + `max_memory` + `offload_folder`.
+
+bitsandbytes quantization requires CUDA; on CPU-only hosts it is skipped
+gracefully. MusicGen (audiocraft `get_pretrained`) exposes no quant hook and is
+the one loader left unquantized.
+
+================================================================================
+## VRAM estimate & eviction
+================================================================================
+
+### codai/models/manager.py :: _get_model_used_vram_gb
+Priority: measured delta (ground truth, factors already baked in) > explicit
+`used_vram_gb` (raw full-precision — quant/offload factors ARE applied on top) >
+local file size > HF cache size. Quantization factor: ÷4 (4-bit), ÷2 (8-bit).
+This is why a 4-bit Wan2.2 no longer reports a bogus ~151 GB.
+
+### Universal eviction invariant
+ANY model of ANY type, before loading when not already resident, must call
+`manager.ensure_vram_for(model_key, resolved_name)` — which evicts other models
+(one or more, LRU first) until there is enough free VRAM. This is wired into
+EVERY load branch of `request_model` (per-model "on-request", per-model "load",
+legacy "loadall", and the ondemand fallback). Do not add a new load path that
+returns `already_loaded: False` without calling `ensure_vram_for` first — that
+was the bug where a "load"-mode text model came back on CPU because the video
+model that displaced it was never evicted.
+
+### codai/models/manager.py :: _evict_models_for_vram(needed_gb)
+Evicts LRU-first ONLY until `free_vram >= needed_gb`, so multiple small models
+COEXIST in VRAM when they fit together. Eviction is synchronous (cleanup +
+`cuda.empty_cache()` complete before returning) so load/unload happens one at a
+time. Accelerate device_map models: call
+`accelerate.hooks.remove_hook_from_submodules(model)` BEFORE moving to CPU,
+otherwise the dispatch hooks keep CUDA tensors alive and VRAM is never freed.
+
+NEVER evict a BUSY model (one with ref_count > 0 in its ModelInstancePool —
+i.e. actively serving a request). Moving its weights off the GPU mid-forward-
+pass crashes the in-flight request with a CUDA device-side assert and poisons
+the context. `_evict_models_for_vram` checks `_is_key_busy(key)` and
+`_wait_until_idle(key)` before evicting; busy non-active models are skipped, the
+active model is waited on. `_evict_key` calls `pool.cleanup_all()` so EVERY
+instance (not just the primary) is freed. Because eviction can BLOCK waiting for
+a model to go idle, callers MUST invoke `request_model` off the event loop
+(`await asyncio.to_thread(...)`) — text and video endpoints already do — or the
+wait deadlocks the very request it is waiting on.
+
+================================================================================
+## Flash-Attention-2 requires the whole model on GPU
+================================================================================
+
+FA2 kernels are CUDA-only and assume every layer is resident on one CUDA device.
+If the model is split GPU+CPU (accelerate offload), FA2 triggers a CUDA
+device-side assert that corrupts the process. Therefore `backends/cuda.py` only
+enables FA2 when the model fits fully in free GPU VRAM, or the user forced
+full-GPU residence (`no_ram=True` or `offload_strategy='none'`). Otherwise it
+falls back to `attn_implementation='sdpa'` (which handles mixed devices and still
+uses flash kernels for GPU-resident layers). The manager passes
+`expected_vram_gb` (= `_get_model_used_vram_gb`) into `load_model` so the backend
+can make this decision. The three UI flash checkboxes (`flash_attention`,
+`sdcpp_flash_attn`, `sdcpp_diffusion_flash_attn`) are OR'd into FA2 intent for
+transformers models, since the sdcpp flags are no-ops for HF models.
+
+================================================================================
+## CUDA device-side assert poisons the whole process — fail fast
+================================================================================
+
+A CUDA "device-side assert triggered" / "illegal memory access" / "CUDA error"
+corrupts the CUDA context PROCESS-WIDE. Every subsequent GPU op then fails with
+the same async assert echo. It is UNRECOVERABLE in-process — the server must be
+restarted. `MultiModelManager._mark_cuda_poisoned_if_fatal(err)` sets
+`cuda_context_poisoned`; `request_model`, the text retry loop, and the video
+endpoint all check it and fail fast with HTTP 503 + "Restart coderai to recover"
+instead of retrying dozens of times. Do NOT add retry loops that re-load onto a
+poisoned context.
+
+================================================================================
+## VRAM is not freed after eviction without expandable_segments
+================================================================================
+
+`codai/__init__.py` sets `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
+BEFORE torch is imported (honouring any pre-set value). This is REQUIRED, not
+optional. Symptom without it: after evicting a model, params move to CPU
+(`memory_allocated` drops) but `torch.cuda.memory_reserved()` stays high and
+`torch.cuda.empty_cache()` frees almost nothing — VRAM stays ~full and the next
+model can't load. Root cause: HF/accelerate keeps the tied embedding/lm_head
+weight (a single ~2 GB live tensor) in `tied_params_map`; the default CUDA
+allocator cannot return a segment that contains ANY live block, so that one
+tensor pins the whole ~18 GB segment. expandable_segments lets the allocator
+return the freed pages around it. Do not remove this env var; do not set CUDA
+config after torch initializes (it is read once at first CUDA use).
+
+`NvidiaBackend.cleanup` also: removes accelerate hooks, walks every submodule
+moving raw `_parameters`/`_buffers` to CPU (model.to('cpu') is a silent no-op on
+dispatched models), and breaks lingering list/dict references to this model's
+GPU tensors (scoped by storage data_ptr so coexisting models are untouched).
+
+================================================================================
+## Chat generation: turn-boundary stop + enable_thinking (model-agnostic)
+================================================================================
+
+This is NOT a Qwen-only server. Keep model-specific handling minimal and
+detection-based. Two helpers in `backends/cuda.py` handle chat generation
+generically:
+
+- `_eos_token_ids()` — returns eos_token_id PLUS any known turn-end token that
+  ACTUALLY EXISTS in this model's vocab (<|im_end|>, <|eot_id|>, <|end|>,
+  <end_of_turn>, …). Each is added only if `convert_tokens_to_ids` returns a
+  real (non-unk) id, so a Llama/Mistral/etc. model just doesn't get <|im_end|>.
+  Without this, models whose turn ends with <|im_end|> (not the eos
+  <|endoftext|>) never stop and hallucinate extra assistant/user turns.
+- `_build_chat_prompt(messages, enable_thinking, add_generation_prompt)` — uses
+  the MODEL'S OWN `tokenizer.apply_chat_template` when it has one (correct
+  special tokens + proper enable_thinking handling), falling back to the legacy
+  formatter otherwise. `enable_thinking` is passed to the template only if it
+  accepts the kwarg (TypeError → retry without), so non-thinking models are
+  unaffected. enable_thinking threads from the request:
+  text.py(reasoning_enabled) → ModelManager.generate_chat[_stream] →
+  NvidiaBackend. Default False (suppress reasoning); True keeps <think> blocks
+  for callers that ask. Do NOT hardcode model-family tokens in generation paths
+  — gate on vocab presence / template support.
+
+================================================================================
+## Thermal protection (model-agnostic, config-driven)
+================================================================================
+
+A long sequence of heavy generations can drive CPU/GPU hot enough that the
+machine's own protection powers it off. `codai/models/thermal.py` guards against
+this: before serving a request against a loaded model it waits until temps are
+safe.
+
+- Single choke point: `MultiModelManager.request_model()` calls
+  `thermal.wait_until_safe()` right after the CUDA-poison check, so EVERY request
+  type (text/image/video/audio/tts/embedding/spatial) is covered once.
+- Mid-generation checkpoints (`thermal.checkpoint(context, throttle_seconds)`)
+  pause long runs that overheat AFTER the pre-request check passed: diffusion
+  step callbacks (`_vid_step_cb` / image `_step_cb`) call it per denoise step;
+  HF text generation adds a `StoppingCriteria` (via `_make_thermal_criteria()`
+  in backends/cuda.py) that runs ON the generate thread — blocking the streamer
+  CONSUMER loop would NOT pause the GPU (generation runs in a separate thread).
+  GGUF/llama.cpp text (backends/vulkan.py) uses the same idea via a llama.cpp
+  `StoppingCriteriaList` (`_make_llama_thermal_criteria()`) passed to every
+  create_(chat_)completion call — llama.cpp evaluates it synchronously per token.
+  Throttled (≈2 s) for high-frequency token loops; unthrottled for per-step.
+- Config lives in config.json `thermal` (ThermalConfig): cpu_enabled, gpu_enabled
+  (default True each), cpu_high/cpu_resume, gpu_high/gpu_resume (default 90/87),
+  poll_seconds (default 5). Editable live in the admin Settings page — saving
+  pushes values onto global_args so no restart is needed.
+- Hysteresis: pause when temp >= *high*, resume only once temp <= *resume*
+  (resume < high). A sensor that can't be read is treated as safe (never blocks).
+- Readers: GPU via nvidia-smi (then rocm-smi, then psutil amdgpu); CPU via psutil
+  (k10temp/coretemp), then /sys/class/thermal, then `sensors`. 2s reading cache.
+- ASYNC REQUIREMENT: the wait is a blocking time.sleep, so request_model MUST
+  always be invoked via `asyncio.to_thread(...)` from async endpoints — never
+  call it directly in an async handler, or the cooldown stalls the event loop
+  and the whole server stops accepting requests. All api/*.py call sites already
+  do this; keep it that way for any new endpoint.
+
+================================================================================
+## Invariants checklist when adding/altering a model loader
+================================================================================
+
+1. Read settings from the per-model config, never from CLI/global_args.
+2. Use the canonical keys (`flash_attention`, `n_ctx`).
+3. transformers/HF-pipeline model → use codai/models/hf_loading helpers.
+   diffusers model → use PipelineQuantizationConfig.
+4. Honour quantization, offload_strategy, offload_dir, max_gpu_percent, no_ram,
+   manual_ram_gb, precision, flash_attention.
+5. Do not `.to()` a quantized model/pipeline; place it via device_map.
+6. Maximize GPU first, then CPU RAM, then disk; CPU-only is last resort.
+7. Never break VRAM coexistence — evict only the minimum needed.
+8. FA2 only when the model fits fully on GPU; else use SDPA (offload-safe).
+9. On a CUDA device-side assert, fail fast (503) — never retry onto a poisoned
+   context; the server must be restarted.
+10. Keep `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` (set in
+    codai/__init__.py before torch import) — without it, evicted-model VRAM is
+    never returned to the driver.
+11. Eviction must free CPU RAM too, not just VRAM: cleanup calls
+    `_trim_cpu_ram()` (glibc malloc_trim) so the evicted model's host-side copy /
+    offloaded weights are returned to the OS and swap is reclaimed — otherwise
+    RSS creeps up across evict/load cycles.
+12. CPU threads are capped to HALF the cores (when >= 8) in codai/__init__.py
+    (OMP/MKL/OpenBLAS env before torch import) so model loading / 4-bit dequant
+    never saturates the machine. Do NOT lower sys.setswitchinterval during long
+    loads — it caused GIL scheduler thrashing (load avg > 10).
+13. request_model() must ALWAYS be called via asyncio.to_thread from async
+    endpoints — it can block (thermal cooldown, waiting for a busy model). A
+    direct call stalls the event loop.
+14. Thermal protection is config-driven and model-agnostic (config.json
+    `thermal`). Don't special-case it per model/backend; it only reads temps and
+    sleeps. Honour the enable flags and high/resume hysteresis.
--- a/CODERAI_API_DOCUMENTATION.md
+++ b/CODERAI_API_DOCUMENTATION.md
+# CoderAI API Documentation
+
+This document describes the full HTTP API exposed by CoderAI, including OpenAI-compatible endpoints, native multimodal endpoints, profile/LoRA APIs, pipelines, admin APIs, examples, and end-to-end workflows.
+
+The API is implemented with FastAPI in `codai/api/app.py` and routers under `codai/api/`, with admin routes under `codai/admin/routes.py`.
+
+## Base URL
+
+Default local server:
+
+```text
+http://127.0.0.1:8776
+```
+
+Most client calls use the `/v1` prefix:
+
+```text
+http://127.0.0.1:8776/v1
+```
+
+## Authentication
+
+CoderAI supports web sessions and API bearer tokens.
+
+For `/v1/*` routes, send:
+
+```http
+Authorization: Bearer <api-token>
+```
+
+Token management is available in the admin UI and admin API:
+
+- `GET /admin/tokens`
+- `GET /admin/api/tokens`
+- `POST /admin/api/tokens`
+
+Notes:
+
+- `/v1/images/progress` is explicitly exempt from bearer auth in middleware.
+- If the admin/session manager is not initialized, API auth can be bypassed by the server.
+- Admin HTML/API routes use signed session cookies; many admin API routes require an admin role.
+- Some profile routes also enforce local API auth internally.
+
+Example reusable shell variables:
+
+```bash
+export CODERAI_URL="http://127.0.0.1:8776"
+export CODERAI_TOKEN="your-api-token"
+```
+
+Example JSON request:
+
+```bash
+curl -s "$CODERAI_URL/v1/models" \
+  -H "Authorization: Bearer $CODERAI_TOKEN"
+```
+
+## Common Data Conventions
+
+### Media Inputs
+
+Media fields usually accept either:
+
+- A URL: `http://...`, `https://...`, or a CoderAI file URL such as `/v1/files/output.png`
+- Raw base64 without a data URL prefix
+- Data URLs such as `data:image/png;base64,...`, `data:video/mp4;base64,...`, `data:audio/wav;base64,...`
+
+### Media Outputs
+
+Generation endpoints typically return:
+
+```json
+{
+  "created": 1781090000,
+  "data": [
+    {
+      "url": "/v1/files/generated.png"
+    }
+  ]
+}
+```
+
+If `response_format` requests base64, the first data item uses a media-specific key:
+
+- Images: `b64_json`
+- Video: `b64_mp4`
+- Audio: `b64_wav` or `b64_mp3`
+
+### Progress Polling
+
+Long-running image, video, audio, and LoRA jobs expose polling endpoints. Typical progress response:
+
+```json
+{
+  "current": 12,
+  "total": 30,
+  "active": true,
+  "phase": "generating",
+  "model": "model-id",
+  "pct": 40.0,
+  "it_per_s": 1.3,
+  "elapsed": 8.9
+}
+```
+
+### Extra Fields
+
+Most request models allow extra JSON fields (`extra="allow"`). This makes the API tolerant of OpenAI-compatible or Studio-style client parameters even when a specific route ignores them.
+
+## Core Endpoints
+
+### List Models
+
+`GET /v1/models`
+
+Returns configured models and metadata.
+
+Response shape:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {
+      "id": "Qwen/Qwen3-8B",
+      "object": "model",
+      "created": 1781090000,
+      "owned_by": "huggingface",
+      "type": "text",
+      "capabilities": ["text_generation"],
+      "backend": "cuda",
+      "model_path": "Qwen/Qwen3-8B",
+      "alias": "qwen3"
+    }
+  ]
+}
+```
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/models" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" | jq
+```
+
+### Capabilities Document
+
+`GET /coderai/capabilities`
+
+Returns CoderAI broker/studio capability metadata and hardware summary. This endpoint is used by AISBF and discovery integrations.
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/coderai/capabilities" | jq
+```
+
+### Serve Generated Files
+
+`GET /v1/files/{filename}`
+
+Returns a generated or uploaded file from the configured output directory. Path traversal is rejected.
+
+Example:
+
+```bash
+curl -L "$CODERAI_URL/v1/files/generated.png" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -o generated.png
+```
+
+### File Archive
+
+`GET /v1/archive`
+
+Lists generated media in the output/archive directory.
+
+```json
+{
+  "files": [
+    {
+      "filename": "image_001.png",
+      "type": "image",
+      "size": 123456,
+      "created": 1781090000,
+      "url": "/v1/files/image_001.png"
+    }
+  ]
+}
+```
+
+`DELETE /v1/archive/{filename}` deletes an archived file.
+
+```bash
+curl -X DELETE "$CODERAI_URL/v1/archive/image_001.png" \
+  -H "Authorization: Bearer $CODERAI_TOKEN"
+```
+
+## Text Generation
+
+CoderAI exposes OpenAI-compatible chat and legacy completion APIs.
+
+### Chat Completions
+
+`POST /v1/chat/completions`
+
+Request fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `model` | string | required | Model id from `/v1/models` |
+| `messages` | array | required | Chat messages with `role` and `content` |
+| `temperature` | number | `0.7` | Sampling temperature |
+| `top_p` | number | `1.0` | Nucleus sampling |
+| `n` | integer | `1` | Number of completions |
+| `max_tokens` | integer/null | `null` | Max generated tokens |
+| `stream` | boolean | `false` | Return SSE chunks |
+| `stop` | string/array/null | `null` | Stop sequence(s) |
+| `presence_penalty` | number | `0.0` | OpenAI-compatible field |
+| `frequency_penalty` | number | `0.0` | OpenAI-compatible field |
+| `repeat_penalty` | number | `1.0` | Repetition penalty |
+| `tools` | array/null | `null` | Function/tool definitions |
+| `tool_choice` | string/object/null | `auto` | Tool selection control |
+| `enable_thinking` | boolean | `false` | Enables reasoning/thinking templates where supported |
+| `response_format` | object/null | `null` | Accepted for compatibility |
+
+Basic request:
+
+```bash
+curl -s "$CODERAI_URL/v1/chat/completions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-8B",
+    "messages": [
+      {"role": "system", "content": "You are concise."},
+      {"role": "user", "content": "Explain VRAM offloading in one paragraph."}
+    ],
+    "temperature": 0.4,
+    "max_tokens": 300
+  }' | jq
+```
+
+Response shape:
+
+```json
+{
+  "id": "chatcmpl-...",
+  "object": "chat.completion",
+  "created": 1781090000,
+  "model": "Qwen/Qwen3-8B",
+  "choices": [
+    {
+      "index": 0,
+      "message": {
+        "role": "assistant",
+        "content": "VRAM offloading..."
+      },
+      "finish_reason": "stop"
+    }
+  ],
+  "usage": {
+    "prompt_tokens": 42,
+    "completion_tokens": 80,
+    "total_tokens": 122
+  }
+}
+```
+
+Streaming request:
+
+```bash
+curl -N "$CODERAI_URL/v1/chat/completions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-8B",
+    "messages": [{"role": "user", "content": "Write a haiku about GPUs."}],
+    "stream": true
+  }'
+```
+
+Streaming responses use server-sent event style lines:
+
+```text
+data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[...]}
+
+data: [DONE]
+```
+
+Tool calling example:
+
+```bash
+curl -s "$CODERAI_URL/v1/chat/completions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-8B",
+    "messages": [{"role": "user", "content": "What is the weather in Rome?"}],
+    "tools": [
+      {
+        "type": "function",
+        "function": {
+          "name": "get_weather",
+          "description": "Get current weather for a city",
+          "parameters": {
+            "type": "object",
+            "properties": {"city": {"type": "string"}},
+            "required": ["city"]
+          }
+        }
+      }
+    ],
+    "tool_choice": "auto"
+  }'
+```
+
+### Legacy Completions
+
+`POST /v1/completions`
+
+Request fields are similar to OpenAI legacy completions:
+
+| Field | Type | Default |
+|---|---:|---:|
+| `model` | string | required |
+| `prompt` | string or string[] | required |
+| `temperature` | number | `0.7` |
+| `top_p` | number | `1.0` |
+| `n` | integer | `1` |
+| `max_tokens` | integer/null | `null` |
+| `stream` | boolean | `false` |
+| `stop` | string/array/null | `null` |
+| `repeat_penalty` | number | `1.0` |
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/completions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-8B",
+    "prompt": "The fastest way to reduce inference memory is",
+    "max_tokens": 120
+  }' | jq
+```
+
+## Images
+
+### Image Progress
+
+`GET /v1/images/progress`
+
+Returns the current image-generation progress. This route is exempt from bearer auth in middleware.
+
+```bash
+curl -s "$CODERAI_URL/v1/images/progress" | jq
+```
+
+### Generate Images
+
+`POST /v1/images/generations`
+
+Request fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `model` | string | required | Image model id |
+| `prompt` | string | required | Positive prompt |
+| `n` | integer | `1` | Number of images |
+| `size` | string | `1024x1024` | Output size |
+| `steps` | integer/null | model default | Inference steps |
+| `guidance_scale` | number/null | model default | CFG/guidance |
+| `quality` | string | `standard` | Compatibility field |
+| `style` | string/null | `null` | Compatibility/style field |
+| `response_format` | string | `url` | `url` or `b64_json` |
+| `seed` | integer/null | random | Deterministic seed |
+| `negative_prompt` | string/null | `null` | Negative prompt |
+| `disable_safety_checker` | boolean | `false` | Disable safety checker where supported |
+| `vae_model` | string/null | `null` | Per-request VAE override |
+| `loras` | array/null | `null` | LoRA adapters `{model, weight, name}` |
+| `character_profiles` | string[]/null | `null` | Saved character profile names |
+| `character_references` | string[]/null | `null` | Inline reference images |
+| `character_strength` | number | `0.6` | IP-Adapter/reference strength |
+| `environment_profiles` | string[]/null | `null` | Saved environment profile names |
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/images/generations" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "stabilityai/stable-diffusion-xl-base-1.0",
+    "prompt": "cinematic photo of a brass robot botanist in a glass greenhouse, morning mist",
+    "negative_prompt": "blurry, low quality, distorted hands",
+    "size": "1024x1024",
+    "steps": 30,
+    "guidance_scale": 7.0,
+    "seed": 12345,
+    "response_format": "url"
+  }' | jq
+```
+
+LoRA example:
+
+```json
+{
+  "model": "image-model",
+  "prompt": "portrait of <character-token> as a space pilot",
+  "loras": [
+    {"model": "/home/me/loras/space_uniform.safetensors", "weight": 0.8, "name": "uniform"}
+  ]
+}
+```
+
+Character/environment consistency example:
+
+```json
+{
+  "model": "image-model",
+  "prompt": "Alice explores the old library at sunset",
+  "character_profiles": ["Alice"],
+  "environment_profiles": ["OldLibrary"],
+  "character_strength": 0.75,
+  "size": "1024x1024"
+}
+```
+
+### Edit Image
+
+`POST /v1/images/edits`
+
+Fields:
+
+- `model` required
+- `prompt` required
+- `image` required, base64/URL source image
+- `mask` optional
+- `n`, `size`, `response_format`, `strength`, `steps`, `guidance_scale`, `seed`, `quality`
+
+```bash
+curl -s "$CODERAI_URL/v1/images/edits" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "image-edit-model",
+    "image": "data:image/png;base64,...",
+    "prompt": "turn the sky into dramatic storm clouds",
+    "strength": 0.55,
+    "response_format": "url"
+  }'
+```
+
+### Inpaint Image
+
+`POST /v1/images/inpaint`
+
+Like edits, but `mask` is required.
+
+```json
+{
+  "model": "inpaint-model",
+  "image": "data:image/png;base64,...",
+  "mask": "data:image/png;base64,...",
+  "prompt": "replace the masked area with a carved wooden door",
+  "strength": 0.99,
+  "steps": 30,
+  "response_format": "url"
+}
+```
+
+### Upscale Image
+
+`POST /v1/images/upscale`
+
+```json
+{
+  "model": "realesrgan-x4plus",
+  "image": "data:image/png;base64,...",
+  "scale": 4,
+  "response_format": "url"
+}
+```
+
+### Depth Map
+
+`POST /v1/images/depth`
+
+```json
+{
+  "model": "depth-anything",
+  "image": "data:image/png;base64,...",
+  "response_format": "url"
+}
+```
+
+### Segment Image
+
+`POST /v1/images/segment`
+
+```json
+{
+  "model": "sam-vit-h",
+  "image": "data:image/png;base64,...",
+  "points": [[420, 300]],
+  "boxes": [[100, 100, 600, 700]],
+  "response_format": "url"
+}
+```
+
+### Deblur Image
+
+`POST /v1/images/deblur`
+
+```json
+{
+  "image": "data:image/png;base64,...",
+  "strength": 0.5,
+  "response_format": "url"
+}
+```
+
+### Unpixelate Image
+
+`POST /v1/images/unpixelate`
+
+```json
+{
+  "model": "realesrgan-x4plus",
+  "image": "data:image/png;base64,...",
+  "scale": 4,
+  "response_format": "url"
+}
+```
+
+### Outfit Change
+
+`POST /v1/images/outfit`
+
+Fields:
+
+- `model` required
+- `image` or `video` optional input
+- `prompt` required outfit/clothing description
+- `negative_prompt`, `mask`, `steps`, `guidance_scale`, `strength`, `seed`, `response_format`
+
+```json
+{
+  "model": "inpaint-model",
+  "image": "data:image/png;base64,...",
+  "prompt": "tailored navy velvet evening suit with silver embroidery",
+  "negative_prompt": "distorted body, extra limbs",
+  "steps": 30,
+  "guidance_scale": 7.5,
+  "strength": 0.92,
+  "response_format": "url"
+}
+```
+
+### Face Swap
+
+`POST /v1/images/faceswap`
+
+```json
+{
+  "source_face": "data:image/png;base64,...",
+  "target": "data:image/png;base64,...",
+  "target_type": "image",
+  "response_format": "url"
+}
+```
+
+For video targets, use `target_type: "video"`.
+
+## Video
+
+### Video Progress
+
+`GET /v1/video/progress`
+
+```bash
+curl -s "$CODERAI_URL/v1/video/progress" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" | jq
+```
+
+### Generate Video
+
+`POST /v1/video/generations`
+
+Primary fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `model` | string | required | Video model id |
+| `prompt` | string | `""` | Text prompt |
+| `negative_prompt` | string/null | `null` | Negative prompt |
+| `width` | integer | `512` | Width |
+| `height` | integer | `512` | Height |
+| `num_frames` | integer/null | model default | Frame count |
+| `fps` | integer/null | model default | Frames per second |
+| `num_inference_steps` | integer/null | model default | Diffusion steps |
+| `guidance_scale` | number/null | model default | CFG/guidance |
+| `seed` | integer/null | random | Seed |
+| `mode` | string | `t2v` | `t2v`, `i2v`, `v2v`, `ti2v`, `interp` |
+| `image` / `init_image` | string/null | `null` | Initial/reference frame |
+| `end_image` | string/null | `null` | End frame for interpolation |
+| `video` | string/null | `null` | Input video for v2v/post-processing |
+| `strength` | number/null | `null` | Denoising strength |
+| `camera_motion` | string/null | `null` | `zoom-in`, `pan-left`, etc. |
+| `character_profiles` | string[]/null | `null` | Saved character profiles |
+| `loras` | array/null | `null` | Video LoRA adapters |
+| `response_format` | string | `url` | `url` or `b64_mp4` |
+
+Text-to-video example:
+
+```bash
+curl -s "$CODERAI_URL/v1/video/generations" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "video-model",
+    "mode": "t2v",
+    "prompt": "a slow dolly shot through a neon market in the rain",
+    "negative_prompt": "low quality, flicker",
+    "width": 768,
+    "height": 432,
+    "num_frames": 49,
+    "fps": 12,
+    "num_inference_steps": 30,
+    "guidance_scale": 6.0,
+    "seed": 9001,
+    "response_format": "url"
+  }' | jq
+```
+
+Image-to-video example:
+
+```json
+{
+  "model": "i2v-model",
+  "mode": "i2v",
+  "prompt": "gentle camera push-in, hair and fabric moving in the wind",
+  "init_image": "data:image/png;base64,...",
+  "num_frames": 32,
+  "fps": 8,
+  "camera_motion": "zoom-in",
+  "response_format": "url"
+}
+```
+
+Video with generated audio, subtitles, dub, and post-processing:
+
+```json
+{
+  "model": "video-model",
+  "prompt": "a robot chef prepares pasta in a futuristic kitchen",
+  "mode": "t2v",
+  "num_frames": 49,
+  "fps": 12,
+  "add_audio": true,
+  "audio_type": "ambient",
+  "audio_prompt": "soft kitchen ambience, gentle synth pad",
+  "generate_subtitles": true,
+  "burn_subtitles": true,
+  "subtitle_style": "minimal",
+  "upscale_output": true,
+  "upscale_factor": 2,
+  "interpolate_output": true,
+  "fps_multiplier": 2,
+  "response_format": "url"
+}
+```
+
+Multi-character dialog example:
+
+```json
+{
+  "model": "video-model",
+  "prompt": "two detectives talk in a dim archive room",
+  "character_profiles": ["DetectiveA", "DetectiveB"],
+  "dialogs": [
+    {"character": "DetectiveA", "voice": "narrator_a", "text": "The file was never missing.", "lip_sync": true},
+    {"character": "DetectiveB", "voice": "narrator_b", "text": "Then someone wanted us to think it was.", "lip_sync": true}
+  ],
+  "burn_subtitles": true,
+  "response_format": "url"
+}
+```
+
+### Upscale Video
+
+`POST /v1/video/upscale`
+
+```json
+{
+  "model": "realesrgan-video",
+  "video": "data:video/mp4;base64,...",
+  "upscale_factor": 2,
+  "response_format": "url"
+}
+```
+
+### Subtitle Video
+
+`POST /v1/video/subtitle`
+
+```json
+{
+  "model": "whisper-large-v3",
+  "video": "data:video/mp4;base64,...",
+  "language": "en",
+  "translate": true,
+  "target_lang": "it",
+  "burn": false,
+  "style": "default",
+  "response_format": "srt"
+}
+```
+
+`response_format` can be `srt`, `vtt`, `json`, or `burned_video`.
+
+### Interpolate Video or Frames
+
+`POST /v1/video/interpolate`
+
+```json
+{
+  "model": "rife",
+  "video": "data:video/mp4;base64,...",
+  "fps_multiplier": 2,
+  "response_format": "url"
+}
+```
+
+Frame interpolation:
+
+```json
+{
+  "model": "rife",
+  "init_image": "data:image/png;base64,...",
+  "end_image": "data:image/png;base64,...",
+  "fps_multiplier": 4,
+  "response_format": "url"
+}
+```
+
+### Dub Video
+
+`POST /v1/video/dub`
+
+```json
+{
+  "model": "whisper-large-v3",
+  "video": "data:video/mp4;base64,...",
+  "source_lang": "en",
+  "target_lang": "es",
+  "voice_clone": true,
+  "burn_subtitles": true,
+  "response_format": "url"
+}
+```
+
+## Audio
+
+### Transcriptions
+
+`POST /v1/audio/transcriptions`
+
+This is an OpenAI-style multipart form endpoint.
+
+Form fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `model` | string | required | Whisper/transcription model |
+| `file` | file | required | Audio/video file upload |
+| `language` | string/null | `null` | Language hint |
+| `prompt` | string/null | `null` | Context prompt |
+| `response_format` | string | `json` | `json`, `verbose_json`, `text`, `srt`, `vtt` |
+| `temperature` | number | `0.0` | Decoding temperature |
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/audio/transcriptions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -F model="whisper-large-v3" \
+  -F file=@speech.wav \
+  -F language="en" \
+  -F response_format="json" | jq
+```
+
+Text-only response:
+
+```bash
+curl -s "$CODERAI_URL/v1/audio/transcriptions" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -F model="whisper-large-v3" \
+  -F file=@speech.wav \
+  -F response_format="text"
+```
+
+### Text-to-Speech
+
+`POST /v1/audio/speech`
+
+Request fields:
+
+- `model` required
+- `input` required text
+- `voice` default `af_sarah`
+- `response_format` default `mp3`
+- `speed` default `1.0`
+- `voice_profile` optional saved profile name
+
+Response:
+
+```json
+{
+  "audio": "<base64-audio>"
+}
+```
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/audio/speech" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "kokoro",
+    "input": "Local inference is online.",
+    "voice": "af_sarah",
+    "response_format": "mp3",
+    "speed": 1.0
+  }' | jq -r .audio | base64 -d > speech.mp3
+```
+
+### Audio Generation Progress
+
+`GET /v1/audio/progress`
+
+```bash
+curl -s "$CODERAI_URL/v1/audio/progress" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" | jq
+```
+
+### Generate Audio / Music / SFX
+
+`POST /v1/audio/generate`
+
+Request fields:
+
+| Field | Type | Default |
+|---|---:|---:|
+| `model` | string | required |
+| `prompt` | string | required |
+| `duration` | number | `10.0` |
+| `top_k` | integer | `250` |
+| `top_p` | number | `0.0` |
+| `temperature` | number | `1.0` |
+| `cfg_coef` | number | `3.0` |
+| `seed` | integer/null | `null` |
+| `melody` | string/null | `null` |
+| `voice_profile` | string/null | `null` |
+| `response_format` | string | `url` |
+
+Example:
+
+```json
+{
+  "model": "facebook/musicgen-medium",
+  "prompt": "warm lo-fi loop with brushed drums and soft Rhodes chords",
+  "duration": 12,
+  "temperature": 1.0,
+  "cfg_coef": 3.0,
+  "seed": 44,
+  "response_format": "url"
+}
+```
+
+Melody-conditioned example:
+
+```json
+{
+  "model": "facebook/musicgen-melody",
+  "prompt": "cinematic orchestral arrangement of the melody",
+  "melody": "data:audio/wav;base64,...",
+  "duration": 20,
+  "response_format": "url"
+}
+```
+
+### Voice Profiles
+
+List voices:
+
+`GET /v1/audio/voices`
+
+Create voice profile:
+
+`POST /v1/audio/voices`
+
+Multipart fields:
+
+- `name`
+- `transcript`
+- `description`
+- `audio` file
+
+```bash
+curl -s "$CODERAI_URL/v1/audio/voices" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -F name="narrator_a" \
+  -F transcript="This is the exact reference transcript." \
+  -F description="Warm narrator voice" \
+  -F audio=@reference.wav | jq
+```
+
+Get, patch, delete:
+
+- `GET /v1/audio/voices/{name}`
+- `PATCH /v1/audio/voices/{name}`
+- `DELETE /v1/audio/voices/{name}`
+
+Extract a voice profile from audio or video:
+
+`POST /v1/audio/voices/extract`
+
+```json
+{
+  "name": "speaker_from_clip",
+  "description": "Extracted from interview clip",
+  "video": "data:video/mp4;base64,...",
+  "transcript": "Optional exact transcript for the selected speech segment."
+}
+```
+
+### Voice Clone
+
+`POST /v1/audio/clone`
+
+Fields:
+
+- `text` required output text
+- `voice_name` optional saved profile
+- `ref_audio` and `ref_text` optional inline reference
+- `speed`, `seed`, `response_format`
+
+Using saved voice:
+
+```json
+{
+  "text": "The archive doors opened at midnight.",
+  "voice_name": "narrator_a",
+  "speed": 0.95,
+  "seed": 10,
+  "response_format": "url"
+}
+```
+
+Using inline reference:
+
+```json
+{
+  "text": "The system is ready.",
+  "ref_audio": "data:audio/wav;base64,...",
+  "ref_text": "This is the reference speaker transcript.",
+  "response_format": "b64_wav"
+}
+```
+
+### Voice Conversion
+
+`POST /v1/audio/convert`
+
+Fields:
+
+- `source_audio` required
+- `target_voice` or `voice_name` optional
+- `f0_condition` singing-mode pitch conditioning
+- `pitch_shift`
+- `diffusion_steps`
+- `length_adjust`
+- `inference_cfg_rate`
+- `response_format`
+
+```json
+{
+  "source_audio": "data:audio/wav;base64,...",
+  "voice_name": "singer_a",
+  "f0_condition": true,
+  "pitch_shift": 0,
+  "diffusion_steps": 20,
+  "response_format": "url"
+}
+```
+
+### Audio Stems
+
+`POST /v1/audio/stems`
+
+```json
+{
+  "audio": "data:audio/wav;base64,...",
+  "stem_mode": "vocals-instrumental",
+  "response_format": "url",
+  "fallback_mode": true
+}
+```
+
+Supported requested split modes include:
+
+- `vocals-instrumental`
+- `4-stem`
+- `drums-bass-other`
+
+### Audio Cleanup
+
+`POST /v1/audio/cleanup`
+
+```json
+{
+  "audio": "data:audio/wav;base64,...",
+  "noise_reduction": true,
+  "normalize": true,
+  "remove_hum": true,
+  "repair_clicks": false,
+  "response_format": "url",
+  "fallback_mode": true
+}
+```
+
+## Embeddings
+
+`POST /v1/embeddings`
+
+Request fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `model` | string | required | Embedding model |
+| `input` | string/string[] | required | Text input(s) |
+| `image` | string/string[]/null | `null` | Optional image input(s) for multimodal embeddings |
+| `encoding_format` | string | `float` | `float` or `base64` |
+| `dimensions` | integer/null | `null` | Optional truncation size |
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/v1/embeddings" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "BAAI/bge-small-en-v1.5",
+    "input": ["first document", "second document"],
+    "encoding_format": "float"
+  }' | jq
+```
+
+Response shape:
+
+```json
+{
+  "object": "list",
+  "data": [
+    {"object": "embedding", "index": 0, "embedding": [0.01, -0.02]},
+    {"object": "embedding", "index": 1, "embedding": [0.03, 0.04]}
+  ],
+  "model": "BAAI/bge-small-en-v1.5",
+  "usage": {"prompt_tokens": 4, "total_tokens": 4}
+}
+```
+
+Multimodal embedding example:
+
+```json
+{
+  "model": "clip-embedding-model",
+  "input": "a red sports car",
+  "image": "data:image/png;base64,...",
+  "encoding_format": "base64"
+}
+```
+
+## Character Profiles
+
+Character profiles are named collections of reference images used for visual identity consistency in image/video generation.
+
+### Create or Replace Character
+
+`POST /v1/characters`
+
+```json
+{
+  "name": "Alice",
+  "description": "Short-haired detective in a charcoal coat",
+  "images": [
+    {"label": "front", "data": "data:image/png;base64,..."},
+    {"label": "side", "data": "data:image/png;base64,..."}
+  ]
+}
+```
+
+Response:
+
+```json
+{"ok": true, "name": "Alice", "image_count": 2}
+```
+
+### List Characters
+
+`GET /v1/characters`
+
+```json
+{
+  "characters": [
+    {"name": "Alice", "description": "...", "image_count": 2, "created_at": 1781090000}
+  ]
+}
+```
+
+### Get Character
+
+`GET /v1/characters/{name}`
+
+Returns profile metadata plus base64 images.
+
+### Patch Character
+
+`PATCH /v1/characters/{name}`
+
+```json
+{
+  "description": "Updated description",
+  "add_images": [{"label": "close-up", "data": "data:image/png;base64,..."}],
+  "remove_indices": [0]
+}
+```
+
+### Delete Character
+
+`DELETE /v1/characters/{name}`
+
+### Generate Character References
+
+`POST /v1/characters/generate`
+
+Generates reference images from text and saves them as a profile.
+
+```json
+{
+  "name": "CaptainNova",
+  "description": "A calm starship captain",
+  "prompt": "consistent character sheet, woman starship captain, front and side views, clean studio lighting",
+  "model": "image-model",
+  "n": 4,
+  "steps": 30,
+  "width": 768,
+  "height": 768
+}
+```
+
+### Extract Character from Media
+
+`POST /v1/characters/extract`
+
+```json
+{
+  "name": "InterviewGuest",
+  "description": "Face crops extracted from source video",
+  "videos": ["data:video/mp4;base64,..."],
+  "max_images": 5
+}
+```
+
+## Environment Profiles
+
+Environment profiles are named collections of reference images used to condition scene/background style.
+
+Routes mirror character profiles:
+
+- `POST /v1/environments`
+- `GET /v1/environments`
+- `GET /v1/environments/{name}`
+- `PATCH /v1/environments/{name}`
+- `DELETE /v1/environments/{name}`
+- `POST /v1/environments/generate`
+- `POST /v1/environments/extract`
+
+Create example:
+
+```json
+{
+  "name": "OldLibrary",
+  "description": "Warm wood, tall shelves, dust in sunset beams",
+  "images": [
+    {"label": "wide", "data": "data:image/png;base64,..."}
+  ]
+}
+```
+
+Generate example:
+
+```json
+{
+  "name": "MarsHangar",
+  "description": "Industrial red planet aircraft hangar",
+  "prompt": "wide cinematic environment concept art of a Mars aircraft hangar, dust, red light, realistic",
+  "model": "image-model",
+  "n": 4,
+  "width": 1024,
+  "height": 768
+}
+```
+
+Use in generation:
+
+```json
+{
+  "model": "image-model",
+  "prompt": "Alice stands beside a parked rover",
+  "character_profiles": ["Alice"],
+  "environment_profiles": ["MarsHangar"]
+}
+```
+
+## LoRA Training and Registry
+
+### Train LoRA
+
+`POST /v1/loras/train`
+
+Request fields:
+
+| Field | Type | Default | Description |
+|---|---:|---:|---|
+| `name` | string | required | LoRA name |
+| `base_model` | string | required | Base model to train against |
+| `train_base_model` | string/null | `null` | Optional training model override |
+| `target` | string | `image` | `image` or `video` |
+| `quantize_4bit` | boolean | `true` | Quantized training where supported |
+| `num_frames` | integer | `1` | Video/frame setting |
+| `character` | string/null | `null` | Use saved character profile |
+| `environment` | string/null | `null` | Use saved environment profile |
+| `images` | string[]/null | `null` | Inline training images |
+| `instance_prompt` | string/null | `null` | Instance prompt/token |
+| `steps` | integer | `800` | Training steps |
+| `rank` | integer | `16` | LoRA rank |
+| `learning_rate` | number | `0.0001` | LR |
+| `resolution` | integer | `512` | Training resolution |
+| `seed` | integer | `42` | Seed |
+
+Example:
+
+```json
+{
+  "name": "alice_identity",
+  "base_model": "image-model",
+  "target": "image",
+  "character": "Alice",
+  "instance_prompt": "photo of alice_person",
+  "steps": 800,
+  "rank": 16,
+  "learning_rate": 0.0001,
+  "resolution": 768
+}
+```
+
+Training is blocking and queued one-at-a-time.
+
+### LoRA Progress
+
+`GET /v1/loras/progress`
+
+```bash
+curl -s "$CODERAI_URL/v1/loras/progress" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" | jq
+```
+
+### LoRA Registry
+
+- `GET /v1/loras`
+- `GET /v1/loras/{name}`
+- `DELETE /v1/loras/{name}`
+
+Use a trained LoRA in image/video requests:
+
+```json
+{
+  "model": "image-model",
+  "prompt": "alice_person in a cyberpunk alley",
+  "loras": [{"model": "alice_identity", "weight": 0.85}]
+}
+```
+
+## 2D / 3D / Spatial APIs
+
+### Image to 3D
+
+`POST /v1/images/to3d`
+
+```json
+{
+  "image": "data:image/png;base64,...",
+  "method": "mesh",
+  "max_shift": 20,
+  "response_format": "url"
+}
+```
+
+`method` can include `stereo`, `anaglyph`, `depth`, or `mesh`.
+
+### 3D to Image
+
+`POST /v1/images/from3d`
+
+```json
+{
+  "model_data": "data:model/gltf-binary;base64,...",
+  "format": "glb",
+  "camera_distance": 2.0,
+  "camera_elevation": 30,
+  "camera_azimuth": 45,
+  "width": 768,
+  "height": 768,
+  "response_format": "url"
+}
+```
+
+### Video to 3D
+
+`POST /v1/video/to3d`
+
+```json
+{
+  "video": "data:video/mp4;base64,...",
+  "method": "anaglyph",
+  "max_shift": 15,
+  "response_format": "url"
+}
+```
+
+### 3D to Video
+
+`POST /v1/video/from3d`
+
+```json
+{
+  "model_data": "data:model/gltf-binary;base64,...",
+  "format": "glb",
+  "frames": 36,
+  "fps": 12,
+  "camera_elevation": 20,
+  "camera_distance": 2.5,
+  "width": 768,
+  "height": 768,
+  "response_format": "url"
+}
+```
+
+### Generate 3D Model
+
+`POST /v1/3d/generate`
+
+```json
+{
+  "prompt": "a stylized low-poly red dragon statue",
+  "model": "3d-model",
+  "steps": 64,
+  "seed": 42,
+  "response_format": "url"
+}
+```
+
+Image-conditioned 3D generation:
+
+```json
+{
+  "image": "data:image/png;base64,...",
+  "model": "triposr",
+  "response_format": "url"
+}
+```
+
+## Built-In Pipelines
+
+Pipelines chain existing endpoints server-side and aggregate `steps` and `data`.
+
+Implementation caveat: `codai/api/pipelines.py` currently imports video helpers named `create_video_generation` and `create_video_dub`, while `codai/api/video.py` defines the route handlers as `video_generations` and `video_dub`. If those aliases are not added elsewhere at runtime, built-in video pipeline calls can fail even though the routes are registered. The lower-level video endpoints documented above are the canonical API surface.
+
+### Image to Video Pipeline
+
+`POST /v1/pipelines/image-to-video`
+
+Steps:
+
+1. Generate image with `image_model`
+2. Animate it with `video_model`
+3. Optionally add audio and upscale
+
+```json
+{
+  "prompt": "a lonely lighthouse under aurora lights, cinematic",
+  "image_model": "image-model",
+  "video_model": "video-model",
+  "image_size": "1024x1024",
+  "image_steps": 30,
+  "image_cfg": 7.0,
+  "image_seed": 100,
+  "num_frames": 32,
+  "fps": 8,
+  "num_inference_steps": 25,
+  "guidance_scale": 6.5,
+  "camera_motion": "zoom-in",
+  "add_audio": true,
+  "audio_type": "ambient",
+  "audio_prompt": "distant waves, soft wind",
+  "upscale_output": true,
+  "response_format": "url"
+}
+```
+
+### Video Dub Pipeline
+
+`POST /v1/pipelines/video-dub`
+
+```json
+{
+  "model": "whisper-large-v3",
+  "video": "data:video/mp4;base64,...",
+  "source_lang": "en",
+  "target_lang": "de",
+  "voice_clone": true,
+  "burn_subtitles": true,
+  "response_format": "url"
+}
+```
+
+### Story Pipeline
+
+`POST /v1/pipelines/story`
+
+Steps:
+
+1. LLM writes visual scene descriptions
+2. Image model generates scene images
+3. Video model animates the first scene
+4. Optional TTS narration
+
+```json
+{
+  "story": "A courier robot crosses a flooded city to deliver a seed vault key.",
+  "text_model": "Qwen/Qwen3-8B",
+  "image_model": "image-model",
+  "video_model": "video-model",
+  "tts_model": "kokoro",
+  "tts_voice": "af_sarah",
+  "num_scenes": 4,
+  "num_frames": 32,
+  "fps": 8,
+  "response_format": "url"
+}
+```
+
+### Audio Dub Pipeline
+
+`POST /v1/pipelines/audio-dub`
+
+Steps:
+
+1. Transcribe source audio/video
+2. Optionally translate transcript
+3. Synthesize dubbed audio with voice cloning
+4. If input is video, replace audio track
+
+```json
+{
+  "video": "data:video/mp4;base64,...",
+  "voice_name": "narrator_a",
+  "source_lang": "en",
+  "target_lang": "fr",
+  "whisper_model": "whisper-large-v3",
+  "speed": 1.0,
+  "burn_subtitles": true,
+  "response_format": "url"
+}
+```
+
+## Custom Pipelines
+
+Custom pipelines let clients define reusable multi-step workflows with template variables.
+
+Implementation caveat: custom pipeline execution calls each handler with `(request, http_request)`. Some handlers in `codai/api/` accept only the request object, so step types whose handlers do not accept an HTTP request may need handler signature adjustments before they run reliably. Treat `/v1/pipelines/step-types` as the server's advertised builder schema and validate complex custom pipelines in your deployment.
+
+### List Custom Pipelines
+
+`GET /v1/pipelines/custom`
+
+### List Step Types
+
+`GET /v1/pipelines/step-types`
+
+Supported step types include:
+
+- `text_gen`
+- `image_gen`
+- `image_edit`
+- `image_inpaint`
+- `image_upscale`
+- `image_deblur`
+- `image_unpix`
+- `image_outfit`
+- `image_faceswap`
+- `video_gen`
+- `video_upscale`
+- `video_sub`
+- `video_interp`
+- `video_dub`
+- `tts`
+- `stt`
+- `audio_gen`
+- `voice_clone`
+- `voice_convert`
+
+Template variables:
+
+- `{{input}}` - pipeline runtime input
+- `{{stepN.output}}` - extracted text/base output from step N
+- `{{stepN.url}}` - first URL output from step N
+- `{{stepN.<field>}}` - any extracted field from step N
+
+### Create Custom Pipeline
+
+`POST /v1/pipelines/custom`
+
+```json
+{
+  "id": "poster-to-trailer",
+  "name": "Poster to Trailer",
+  "description": "Generate a poster concept, animate it, then create music.",
+  "steps": [
+    {
+      "type": "text_gen",
+      "label": "Write visual prompt",
+      "params": {
+        "model": "Qwen/Qwen3-8B",
+        "system": "Write vivid visual prompts only.",
+        "prompt": "Turn this idea into a cinematic image prompt: {{input}}"
+      }
+    },
+    {
+      "type": "image_gen",
+      "label": "Generate poster",
+      "params": {
+        "model": "image-model",
+        "prompt": "{{step0.output}}",
+        "size": "1024x1024"
+      }
+    },
+    {
+      "type": "video_gen",
+      "label": "Animate poster",
+      "params": {
+        "model": "video-model",
+        "mode": "i2v",
+        "prompt": "{{step0.output}}, slow cinematic movement",
+        "init_image": "{{step1.url}}",
+        "num_frames": 32,
+        "fps": 8
+      }
+    },
+    {
+      "type": "audio_gen",
+      "label": "Create soundtrack",
+      "params": {
+        "model": "musicgen",
+        "prompt": "epic short trailer music for: {{input}}",
+        "duration": 12
+      },
+      "continue_on_error": true
+    }
+  ]
+}
+```
+
+### Update and Delete
+
+- `PUT /v1/pipelines/custom/{pipeline_id}`
+- `DELETE /v1/pipelines/custom/{pipeline_id}`
+
+### Run Saved Pipeline
+
+`POST /v1/pipelines/custom/{pipeline_id}/run`
+
+```json
+{
+  "input": "a solar-powered train crossing the Sahara at night"
+}
+```
+
+### Run Inline Pipeline
+
+`POST /v1/pipelines/run`
+
+Sends a `PipelineDefinition` directly without saving. The current implementation executes with an empty `{{input}}`, so include static params or use saved pipeline run when runtime input is required.
+
+### Audio Understanding Pipeline
+
+`POST /v1/pipelines/audio-understand`
+
+Transcribes audio, then optionally asks a text model to summarize or reason over it.
+
+```json
+{
+  "audio": "data:audio/wav;base64,...",
+  "audio_model": "whisper-large-v3",
+  "text_model": "Qwen/Qwen3-8B",
+  "input": "Summarize action items and decisions.",
+  "language": "en"
+}
+```
+
+### Audio Music Dub Pipeline
+
+`POST /v1/pipelines/audio-music-dub`
+
+Current implementation returns a structured workflow with placeholder stages for stems, translation/adaptation, voice conversion, and remix.
+
+```json
+{
+  "audio": "data:audio/wav;base64,...",
+  "audio_model": "whisper-large-v3",
+  "target_lang": "it",
+  "source_lang": "en",
+  "notes": "Preserve rhyme and chorus structure."
+}
+```
+
+## Admin HTML Routes
+
+Admin pages are session-cookie based.
+
+| Method | Path | Purpose | Auth |
+|---|---|---|---|
+| `GET` | `/login` | Login page | Public |
+| `POST` | `/login` | Login form | Public |
+| `GET` | `/logout` | Logout | Optional session |
+| `GET` | `/admin/change-password` | Password change page | Logged-in |
+| `POST` | `/admin/change-password` | Change password | Logged-in |
+| `GET` | `/admin` | Dashboard | Logged-in |
+| `GET` | `/admin/models` | Model management page | Admin |
+| `GET` | `/admin/tokens` | Token page | Admin |
+| `GET` | `/admin/users` | User page | Admin |
+| `GET` | `/chat` | Chat UI | Logged-in |
+| `GET` | `/admin/settings` | Settings page | Admin |
+| `GET` | `/admin/archive` | Archive page | Admin |
+
+Static assets are mounted under `/static/admin/*`.
+
+## Admin API
+
+Admin APIs usually require a valid session cookie and admin role unless noted.
+
+### Status, Users, Tokens
+
+| Method | Path | Body/Query | Purpose |
+|---|---|---|---|
+| `GET` | `/admin/api/status` | none | System, model, VRAM, queue, recent activity status |
+| `POST` | `/admin/api/users` | `{username,password,role}` | Create user |
+| `DELETE` | `/admin/api/users/{user_id}` | path | Delete user |
+| `GET` | `/admin/api/tokens` | none | List API tokens |
+| `POST` | `/admin/api/tokens` | `{name, provider?}` | Create token |
+| `DELETE` | `/admin/api/tokens/{token_id}` | path | Delete token |
+| `POST` | `/admin/api/system/reload` | none | Reload config/system state |
+
+Create token example after logging in with a session cookie:
+
+```bash
+curl -s "$CODERAI_URL/admin/api/tokens" \
+  -b cookies.txt \
+  -H "Content-Type: application/json" \
+  -d '{"name":"automation","provider":"local"}' | jq
+```
+
+### Model and Cache Management
+
+| Method | Path | Body/Query | Purpose |
+|---|---|---|---|
+| `GET` | `/admin/api/models` | none | List configured models |
+| `POST` | `/admin/api/model-download` | `{model_id,file_pattern?}` | Start Hugging Face download |
+| `GET` | `/admin/api/download-stream/{session_id}` | path | SSE download progress |
+| `GET` | `/admin/api/downloads` | none | Active/recent downloads |
+| `POST` | `/admin/api/download-cancel/{session_id}` | path | Cancel download |
+| `POST` | `/admin/api/model-upload` | multipart chunk | Chunked model upload |
+| `DELETE` | `/admin/api/models/{model_identifier}` | path | Remove cached model |
+| `GET` | `/admin/api/hf-files` | `repo_id` | List HF repo files |
+| `GET` | `/admin/api/cached-models` | none | Local cache inventory |
+| `GET` | `/admin/api/cache-stats` | none | Disk/cache stats |
+| `DELETE` | `/admin/api/cache` | `cache_type=all|hf|gguf` | Clear cache |
+| `DELETE` | `/admin/api/cached-models/{model_id:path}` | `cache_type` | Delete cached model |
+| `POST` | `/admin/api/model-enable` | `{path|model_id,model_type}` | Enable model in config |
+| `POST` | `/admin/api/model-disable` | `{path|model_id,config_id?}` | Disable model |
+| `GET` | `/admin/api/model-loaded-status` | none | Loaded model / pool info |
+| `POST` | `/admin/api/model-load` | `{path}` | Load model now |
+| `POST` | `/admin/api/model-unload` | `{path}` | Unload model |
+| `POST` | `/admin/api/model-configure` | model config JSON | Configure model |
+
+Download with SSE progress:
+
+```bash
+SESSION_ID=$(curl -s "$CODERAI_URL/admin/api/model-download" \
+  -b cookies.txt \
+  -H "Content-Type: application/json" \
+  -d '{"model_id":"Qwen/Qwen3-8B"}' | jq -r .session_id)
+
+curl -N "$CODERAI_URL/admin/api/download-stream/$SESSION_ID" -b cookies.txt
+```
+
+SSE events include `progress`, `done`, `error`, and `keepalive`.
+
+### Settings and Archive Admin
+
+| Method | Path | Body/Query | Purpose |
+|---|---|---|---|
+| `GET` | `/admin/api/settings` | none | Current config sections |
+| `POST` | `/admin/api/settings` | partial settings JSON | Save settings |
+| `GET` | `/admin/api/archive` | `limit`, `offset` | List archive entries |
+| `GET` | `/admin/api/archive/{gen_id}` | path | Archive entry detail |
+| `DELETE` | `/admin/api/archive/{gen_id}` | path | Delete archive entry |
+| `GET` | `/admin/api/archive/{gen_id}/files/{filename}` | path | Download archive file |
+| `GET` | `/admin/api/archive-settings` | none | Archive config and retention options |
+
+Settings include server/backend/model/offload/vulkan/archive/thermal/broker/parser/system-prompt sections.
+
+### Hugging Face Search and Metadata
+
+| Method | Path | Query | Purpose |
+|---|---|---|---|
+| `GET` | `/admin/api/hf-search` | `q`, `gguf_mode`, `pipeline_tag`, `sort`, `sizes`, `arch`, `capabilities`, `component_type` | Search models |
+| `GET` | `/admin/api/hf-model-files` | `model_id` | List GGUF/model files with size/quant metadata |
+| `GET` | `/admin/api/hf-model-info` | `model_id` | Full HF model metadata summary |
+
+Example:
+
+```bash
+curl -s "$CODERAI_URL/admin/api/hf-search?q=whisper&capabilities=speech_to_text" \
+  -b cookies.txt | jq
+```
+
+### Admin Profile Proxies
+
+Logged-in users can access profile metadata through admin routes:
+
+| Method | Path | Purpose |
+|---|---|---|
+| `GET` | `/admin/api/characters` | List characters |
+| `GET` | `/admin/api/characters/{name}` | Character detail |
+| `GET` | `/admin/api/characters/{name}/thumbnail` | Character thumbnail |
+| `DELETE` | `/admin/api/characters/{name}` | Delete character |
+| `GET` | `/admin/api/environments` | List environments |
+| `GET` | `/admin/api/environments/{name}` | Environment detail |
+| `GET` | `/admin/api/environments/{name}/thumbnail` | Environment thumbnail |
+| `DELETE` | `/admin/api/environments/{name}` | Delete environment |
+| `GET` | `/admin/api/voices` | List voice profiles |
+| `GET` | `/admin/api/voices/{name}` | Voice detail |
+| `DELETE` | `/admin/api/voices/{name}` | Delete voice |
+
+## AISBF / Broker Integration
+
+CoderAI exposes:
+
+- `GET /coderai/capabilities`
+- OpenAI-compatible `/v1/models` and `/v1/chat/completions`
+- Native `/v1/*` endpoints that can be proxied by AISBF
+
+AISBF broker mode uses outbound WebSocket connections from CoderAI to AISBF for NAT traversal. The canonical broker protocol is documented in `coderai-broker-implementation-reference.md`.
+
+Global-scope broker URL template:
+
+```text
+wss://<aisbf-host>/api/coderai/wss?provider_id=<provider_id>&client_id=<client_id>&username=global&registration_token=<token>
+```
+
+User-scope broker URL template:
+
+```text
+wss://<aisbf-host>/api/u/<username>/coderai/wss?provider_id=<provider_id>&client_id=<client_id>&username=<username>&registration_token=<token>
+```
+
+Important broker fields:
+
+- `provider_id` identifies the AISBF provider configuration.
+- `client_id` must be stable and match the provider config.
+- `username` is `global` or the AISBF username for user-scoped providers.
+- `registration_token` is provider-scoped and required for admission.
+
+AISBF can call operations such as `models.list`, `chat.completions`, `capabilities`, `register`, and `proxy`. Proxy operations can forward headers, query params, multipart form payloads, binary/base64 bodies, progress polling endpoints, and streaming envelopes.
+
+## Error Handling
+
+Common HTTP status codes:
+
+| Status | Meaning |
+|---:|---|
+| `400` | Invalid request, missing required media, or incompatible fields |
+| `401` | Missing/invalid token or session |
+| `403` | Forbidden, unsafe file path, or insufficient role |
+| `404` | Model, profile, file, pipeline, or archive entry not found |
+| `422` | Validation error for strict fields |
+| `429` | Rate limit or queue saturation |
+| `500` | Generation/backend failure |
+| `501` | Optional backend not installed |
+| `503` | Model/backend unavailable or CUDA context poisoned |
+
+Typical auth error:
+
+```json
+{
+  "detail": {
+    "message": "Invalid API key. Provide a valid Bearer token.",
+    "type": "invalid_request_error",
+    "code": "invalid_api_key"
+  }
+}
+```
+
+If a CUDA device-side assert or illegal memory access poisons the context, CoderAI fails fast with a `503` instructing that the process must be restarted.
+
+## Complex Workflows
+
+### Workflow 1: Consistent Character Image and Video
+
+Goal: create a character, generate a scene image using that identity, then animate it.
+
+1. Create character profile:
+
+```bash
+curl -s "$CODERAI_URL/v1/characters" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name":"Alice",
+    "description":"Detective with short black hair and charcoal coat",
+    "images":[{"label":"front","data":"data:image/png;base64,..."}]
+  }'
+```
+
+2. Generate an image with the profile:
+
+```bash
+IMAGE_URL=$(curl -s "$CODERAI_URL/v1/images/generations" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model":"image-model",
+    "prompt":"Alice in a rainy neon alley, cinematic detective noir",
+    "character_profiles":["Alice"],
+    "character_strength":0.75,
+    "size":"1024x1024",
+    "response_format":"url"
+  }' | jq -r '.data[0].url')
+```
+
+3. Animate the image:
+
+```bash
+curl -s "$CODERAI_URL/v1/video/generations" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d "{
+    \"model\":\"video-model\",
+    \"mode\":\"i2v\",
+    \"prompt\":\"Alice looks up as rain falls, subtle camera push-in\",
+    \"init_image\":\"$IMAGE_URL\",
+    \"num_frames\":32,
+    \"fps\":8,
+    \"camera_motion\":\"zoom-in\",
+    \"response_format\":\"url\"
+  }" | jq
+```
+
+### Workflow 2: Full Story Generation
+
+Use the built-in story pipeline to generate a script, scene images, a short video, and narration.
+
+```bash
+curl -s "$CODERAI_URL/v1/pipelines/story" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "story":"A botanist finds a singing plant inside a crashed satellite.",
+    "text_model":"Qwen/Qwen3-8B",
+    "image_model":"image-model",
+    "video_model":"video-model",
+    "tts_model":"kokoro",
+    "tts_voice":"af_sarah",
+    "num_scenes":4,
+    "num_frames":32,
+    "fps":8,
+    "response_format":"url"
+  }' | jq
+```
+
+Output includes:
+
+- `steps[0].text` generated scene script
+- `steps[1].urls` generated images
+- `data[0].video_url`
+- `data[0].audio_url`
+
+### Workflow 3: Multilingual Video Dubbing
+
+1. Upload or encode the source video as a data URL.
+2. Call the video dub pipeline.
+3. Poll `/v1/video/progress` if needed.
+4. Download output from returned URL.
+
+```bash
+curl -s "$CODERAI_URL/v1/pipelines/video-dub" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model":"whisper-large-v3",
+    "video":"data:video/mp4;base64,...",
+    "source_lang":"en",
+    "target_lang":"ja",
+    "voice_clone":true,
+    "burn_subtitles":true,
+    "response_format":"url"
+  }' | jq
+```
+
+For lower-level control, use:
+
+- `POST /v1/video/subtitle`
+- `POST /v1/audio/clone`
+- `POST /v1/video/dub`
+
+### Workflow 4: Audio Meeting Summary
+
+Transcribe a meeting and summarize action items with a text model.
+
+```bash
+curl -s "$CODERAI_URL/v1/pipelines/audio-understand" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "audio":"data:audio/wav;base64,...",
+    "audio_model":"whisper-large-v3",
+    "text_model":"Qwen/Qwen3-8B",
+    "language":"en",
+    "input":"Extract decisions, owners, deadlines, and unresolved questions."
+  }' | jq
+```
+
+### Workflow 5: Train and Apply a Character LoRA
+
+1. Build a character profile:
+
+```json
+{
+  "name": "Mira",
+  "description": "Explorer with copper curls and a green field jacket",
+  "images": [{"label": "front", "data": "data:image/png;base64,..."}]
+}
+```
+
+2. Train LoRA:
+
+```bash
+curl -s "$CODERAI_URL/v1/loras/train" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name":"mira_lora",
+    "base_model":"image-model",
+    "target":"image",
+    "character":"Mira",
+    "instance_prompt":"photo of mira_person",
+    "steps":800,
+    "rank":16,
+    "resolution":768
+  }' | jq
+```
+
+3. Poll progress:
+
+```bash
+watch -n 2 "curl -s '$CODERAI_URL/v1/loras/progress' -H 'Authorization: Bearer $CODERAI_TOKEN' | jq"
+```
+
+4. Generate with LoRA:
+
+```json
+{
+  "model": "image-model",
+  "prompt": "photo of mira_person exploring alien ruins, cinematic backlight",
+  "loras": [{"model": "mira_lora", "weight": 0.8}],
+  "response_format": "url"
+}
+```
+
+### Workflow 6: Custom Pipeline for Automated Media Asset Creation
+
+Create a reusable pipeline that converts a product idea into a slogan, hero image, promo video, and voiceover.
+
+```bash
+curl -s "$CODERAI_URL/v1/pipelines/custom" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{
+    "id":"product-media-kit",
+    "name":"Product Media Kit",
+    "description":"Slogan, image, video, and voiceover for a product concept.",
+    "steps":[
+      {
+        "type":"text_gen",
+        "label":"Write slogan and image prompt",
+        "params":{
+          "model":"Qwen/Qwen3-8B",
+          "system":"Return a concise slogan, then a vivid image prompt.",
+          "prompt":"Product concept: {{input}}"
+        }
+      },
+      {
+        "type":"image_gen",
+        "label":"Hero image",
+        "params":{
+          "model":"image-model",
+          "prompt":"{{step0.output}}",
+          "size":"1024x1024",
+          "response_format":"url"
+        }
+      },
+      {
+        "type":"video_gen",
+        "label":"Promo animation",
+        "params":{
+          "model":"video-model",
+          "mode":"i2v",
+          "prompt":"premium product commercial, elegant camera motion, {{step0.output}}",
+          "init_image":"{{step1.url}}",
+          "num_frames":32,
+          "fps":8,
+          "response_format":"url"
+        }
+      },
+      {
+        "type":"tts",
+        "label":"Voiceover",
+        "params":{
+          "model":"kokoro",
+          "input":"{{step0.output}}",
+          "voice":"af_sarah",
+          "speed":1.0
+        },
+        "continue_on_error":true
+      }
+    ]
+  }' | jq
+
+curl -s "$CODERAI_URL/v1/pipelines/custom/product-media-kit/run" \
+  -H "Authorization: Bearer $CODERAI_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"input":"A compact solar charger for hikers and emergency kits"}' | jq
+```
+
+## Practical Client Patterns
+
+### Polling Progress While a Job Runs
+
+Use a second terminal while a generation request is running:
+
+```bash
+while true; do
+  curl -s "$CODERAI_URL/v1/video/progress" \
+    -H "Authorization: Bearer $CODERAI_TOKEN" | jq -c
+  sleep 2
+done
+```
+
+### Python Chat Client
+
+```python
+import requests
+
+base = "http://127.0.0.1:8776"
+token = "your-api-token"
+
+resp = requests.post(
+    f"{base}/v1/chat/completions",
+    headers={"Authorization": f"Bearer {token}"},
+    json={
+        "model": "Qwen/Qwen3-8B",
+        "messages": [{"role": "user", "content": "Write a CLI release note."}],
+        "temperature": 0.3,
+    },
+    timeout=300,
+)
+resp.raise_for_status()
+print(resp.json()["choices"][0]["message"]["content"])
+```
+
+### Python Streaming Chat Client
+
+```python
+import json
+import requests
+
+base = "http://127.0.0.1:8776"
+token = "your-api-token"
+
+with requests.post(
+    f"{base}/v1/chat/completions",
+    headers={"Authorization": f"Bearer {token}"},
+    json={
+        "model": "Qwen/Qwen3-8B",
+        "messages": [{"role": "user", "content": "Count to five slowly."}],
+        "stream": True,
+    },
+    stream=True,
+    timeout=300,
+) as r:
+    r.raise_for_status()
+    for line in r.iter_lines(decode_unicode=True):
+        if not line or not line.startswith("data: "):
+            continue
+        payload = line[6:]
+        if payload == "[DONE]":
+            break
+        event = json.loads(payload)
+        delta = event["choices"][0].get("delta", {})
+        print(delta.get("content", ""), end="", flush=True)
+```
+
+### OpenAI Python SDK Compatibility
+
+For OpenAI-compatible text routes:
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://127.0.0.1:8776/v1",
+    api_key="your-api-token",
+)
+
+response = client.chat.completions.create(
+    model="Qwen/Qwen3-8B",
+    messages=[{"role": "user", "content": "Explain local model routing."}],
+)
+print(response.choices[0].message.content)
+```
+
+## Endpoint Index
+
+### Public `/v1` and Discovery
+
+| Method | Path |
+|---|---|
+| `GET` | `/v1/models` |
+| `GET` | `/coderai/capabilities` |
+| `GET` | `/v1/files/{filename}` |
+| `GET` | `/v1/archive` |
+| `DELETE` | `/v1/archive/{filename}` |
+| `POST` | `/v1/chat/completions` |
+| `POST` | `/v1/completions` |
+| `GET` | `/v1/images/progress` |
+| `POST` | `/v1/images/generations` |
+| `POST` | `/v1/images/edits` |
+| `POST` | `/v1/images/inpaint` |
+| `POST` | `/v1/images/upscale` |
+| `POST` | `/v1/images/depth` |
+| `POST` | `/v1/images/segment` |
+| `POST` | `/v1/images/deblur` |
+| `POST` | `/v1/images/unpixelate` |
+| `POST` | `/v1/images/outfit` |
+| `POST` | `/v1/images/faceswap` |
+| `GET` | `/v1/video/progress` |
+| `POST` | `/v1/video/generations` |
+| `POST` | `/v1/video/upscale` |
+| `POST` | `/v1/video/subtitle` |
+| `POST` | `/v1/video/interpolate` |
+| `POST` | `/v1/video/dub` |
+| `POST` | `/v1/audio/transcriptions` |
+| `POST` | `/v1/audio/speech` |
+| `GET` | `/v1/audio/progress` |
+| `POST` | `/v1/audio/generate` |
+| `GET` | `/v1/audio/voices` |
+| `POST` | `/v1/audio/voices` |
+| `GET` | `/v1/audio/voices/{name}` |
+| `PATCH` | `/v1/audio/voices/{name}` |
+| `DELETE` | `/v1/audio/voices/{name}` |
+| `POST` | `/v1/audio/voices/extract` |
+| `POST` | `/v1/audio/clone` |
+| `POST` | `/v1/audio/convert` |
+| `POST` | `/v1/audio/stems` |
+| `POST` | `/v1/audio/cleanup` |
+| `POST` | `/v1/embeddings` |
+| `POST` | `/v1/characters` |
+| `GET` | `/v1/characters` |
+| `GET` | `/v1/characters/{name}` |
+| `PATCH` | `/v1/characters/{name}` |
+| `DELETE` | `/v1/characters/{name}` |
+| `POST` | `/v1/characters/generate` |
+| `POST` | `/v1/characters/extract` |
+| `POST` | `/v1/environments` |
+| `GET` | `/v1/environments` |
+| `GET` | `/v1/environments/{name}` |
+| `PATCH` | `/v1/environments/{name}` |
+| `DELETE` | `/v1/environments/{name}` |
+| `POST` | `/v1/environments/generate` |
+| `POST` | `/v1/environments/extract` |
+| `POST` | `/v1/loras/train` |
+| `GET` | `/v1/loras/progress` |
+| `GET` | `/v1/loras` |
+| `GET` | `/v1/loras/{name}` |
+| `DELETE` | `/v1/loras/{name}` |
+| `POST` | `/v1/images/to3d` |
+| `POST` | `/v1/images/from3d` |
+| `POST` | `/v1/video/to3d` |
+| `POST` | `/v1/video/from3d` |
+| `POST` | `/v1/3d/generate` |
+| `POST` | `/v1/pipelines/image-to-video` |
+| `POST` | `/v1/pipelines/video-dub` |
+| `POST` | `/v1/pipelines/story` |
+| `POST` | `/v1/pipelines/audio-dub` |
+| `GET` | `/v1/pipelines/custom` |
+| `GET` | `/v1/pipelines/step-types` |
+| `POST` | `/v1/pipelines/custom` |
+| `PUT` | `/v1/pipelines/custom/{pipeline_id}` |
+| `DELETE` | `/v1/pipelines/custom/{pipeline_id}` |
+| `POST` | `/v1/pipelines/custom/{pipeline_id}/run` |
+| `POST` | `/v1/pipelines/run` |
+| `POST` | `/v1/pipelines/audio-understand` |
+| `POST` | `/v1/pipelines/audio-music-dub` |
+
+### Admin API
+
+| Method | Path |
+|---|---|
+| `GET` | `/admin/api/status` |
+| `POST` | `/admin/api/users` |
+| `DELETE` | `/admin/api/users/{user_id}` |
+| `GET` | `/admin/api/tokens` |
+| `POST` | `/admin/api/tokens` |
+| `DELETE` | `/admin/api/tokens/{token_id}` |
+| `GET` | `/admin/api/models` |
+| `POST` | `/admin/api/model-download` |
+| `GET` | `/admin/api/download-stream/{session_id}` |
+| `GET` | `/admin/api/downloads` |
+| `POST` | `/admin/api/download-cancel/{session_id}` |
+| `POST` | `/admin/api/model-upload` |
+| `DELETE` | `/admin/api/models/{model_identifier}` |
+| `GET` | `/admin/api/hf-files` |
+| `GET` | `/admin/api/cached-models` |
+| `GET` | `/admin/api/cache-stats` |
+| `DELETE` | `/admin/api/cache` |
+| `DELETE` | `/admin/api/cached-models/{model_id:path}` |
+| `POST` | `/admin/api/model-enable` |
+| `POST` | `/admin/api/model-disable` |
+| `GET` | `/admin/api/model-loaded-status` |
+| `POST` | `/admin/api/model-load` |
+| `POST` | `/admin/api/model-unload` |
+| `POST` | `/admin/api/model-configure` |
+| `POST` | `/admin/api/system/reload` |
+| `GET` | `/admin/api/settings` |
+| `POST` | `/admin/api/settings` |
+| `GET` | `/admin/api/archive` |
+| `GET` | `/admin/api/archive/{gen_id}` |
+| `DELETE` | `/admin/api/archive/{gen_id}` |
+| `GET` | `/admin/api/archive/{gen_id}/files/{filename}` |
+| `GET` | `/admin/api/archive-settings` |
+| `GET` | `/admin/api/hf-search` |
+| `GET` | `/admin/api/hf-model-files` |
+| `GET` | `/admin/api/hf-model-info` |
+| `GET` | `/admin/api/characters` |
+| `GET` | `/admin/api/characters/{name}` |
+| `GET` | `/admin/api/characters/{name}/thumbnail` |
+| `DELETE` | `/admin/api/characters/{name}` |
+| `GET` | `/admin/api/environments` |
+| `GET` | `/admin/api/environments/{name}` |
+| `GET` | `/admin/api/environments/{name}/thumbnail` |
+| `DELETE` | `/admin/api/environments/{name}` |
+| `GET` | `/admin/api/voices` |
+| `GET` | `/admin/api/voices/{name}` |
+| `DELETE` | `/admin/api/voices/{name}` |
--- a/build.sh
+++ b/build.sh
@@ -173,6 +173,16 @@ if [ "$BACKEND" = "nvidia" ]; then
        echo -e "${YELLOW}Note: audiocraft not installed (audio generation with MusicGen optional)${NC}"
    }

+    # Optional quantization backends for diffusers image/video pipelines:
+    #   optimum-quanto -> enables 2-bit (int2) per-component quantization
+    #   gguf           -> enables loading GGUF-quantized components (Q5_K/Q6_K, etc.)
+    # bitsandbytes (4-bit/8-bit) comes via requirements-nvidia.txt; these add the
+    # extra widths that bitsandbytes cannot do.
+    echo -e "${YELLOW}Installing optional quantization backends (2-bit / GGUF)...${NC}"
+    pip install optimum-quanto gguf || {
+        echo -e "${YELLOW}Note: optimum-quanto/gguf not installed (2-bit and GGUF 5/6-bit quantization optional)${NC}"
+    }
+
    # Install Flash Attention 2 if requested
    if [ "$FLASH" = true ]; then
        echo ""

--- a/codai/__init__.py
+++ b/codai/__init__.py
@@ -14,6 +14,63 @@
 # You should have received a copy of the GNU General Public License
 # along with this program. If not, see <https://www.gnu.org/licenses/>.

+# Configure the CUDA caching allocator BEFORE torch is imported anywhere.
+# expandable_segments lets the allocator return freed pages to the driver even
+# from partially-used segments.  Without it, a single small live tensor (e.g. a
+# tied embedding weight) pins an entire large segment, so torch.cuda.empty_cache()
+# cannot release the GBs of already-freed weights around it after a model is
+# evicted — VRAM stays occupied and the next model can't load.  Honour any value
+# the user already set.
+import os as _os
+_alloc_conf = _os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
+if "expandable_segments" not in _alloc_conf:
+    _os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
+        (_alloc_conf + ",") if _alloc_conf else ""
+    ) + "expandable_segments:True"
+
+# Cap CPU threads BEFORE torch / OpenMP / MKL initialise.  Loading and 4-bit
+# dequantising large models is CPU-heavy; left uncapped, torch/OpenMP grab every
+# core and the machine's load average spikes and it becomes sluggish.  On boxes
+# with >= 8 cores, limit to HALF the cores so model loads never saturate the
+# machine.  Smaller machines keep the default (don't cripple them).  Honour any
+# value the user already set.
+try:
+    _ncpu = _os.cpu_count() or 0
+    if _ncpu >= 8:
+        _cap = str(max(1, _ncpu // 2))
+        for _var in ("OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
+                     "NUMEXPR_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
+            _os.environ.setdefault(_var, _cap)
+except Exception:
+    pass
+
+# Silence ONE specific upstream FutureWarning from bitsandbytes' quant kernels:
+#   bitsandbytes/backends/cuda/ops.py: torch._check_is_size(blocksize)
+# bitsandbytes (latest, 0.49.2) still calls the deprecated torch._check_is_size
+# on bleeding-edge torch.  We don't call it ourselves and can't fix their source,
+# so suppress just this message (not warnings in general) to keep logs readable.
+import warnings as _warnings
+_warnings.filterwarnings(
+    "ignore",
+    message=r".*_check_is_size will be removed.*",
+    category=FutureWarning,
+)
+# More upstream / diagnostic-only noise we can't fix from here:
+#   - huggingface_hub: diffusers/transformers pass the deprecated
+#     `local_dir_use_symlinks` kwarg to hf_hub_download (not our code).
+#   - torch.distributed.reduce_op: emitted while the debug leak-scanner walks
+#     gc.get_objects(); unavoidable without dropping the scan.
+_warnings.filterwarnings(
+    "ignore",
+    message=r".*local_dir_use_symlinks.*",
+    category=UserWarning,
+)
+_warnings.filterwarnings(
+    "ignore",
+    message=r".*reduce_op.*is deprecated.*",
+    category=FutureWarning,
+)
+
 # codai module - AI model parsing utilities
 from .models.parser import (
    ModelParserDispatcher,

--- a/codai/admin/templates/chat.html
+++ b/codai/admin/templates/chat.html
@@ -15,8 +15,9 @@
 .sidebar {
  width:220px; min-width:180px; background:var(--surface-1);
  border-right:1px solid var(--border); display:flex; flex-direction:column;
-  overflow:hidden; flex-shrink:0;
+  overflow:hidden; flex-shrink:0; transition:width .15s, min-width .15s;
 }
+.sidebar.hidden { width:0; min-width:0; border-right:none; overflow:hidden; }
 .sidebar-hd { padding:.6rem 1rem .15rem; font-size:10px; font-weight:700;
  color:var(--text-3); letter-spacing:.07em; text-transform:uppercase; }
 .model-list { flex:1; overflow-y:auto; padding:.2rem .4rem .5rem; }
@@ -242,6 +243,52 @@ a.dl { display:inline-block; margin-top:.4rem; }
 .req-preview-actions { display:flex; gap:.4rem; flex-wrap:wrap; align-items:center; }
 .req-preview-status { font-size:11px; color:var(--text-3); min-height:14px; }

+/* ── Model pick block (Studio per-panel selectors) ───────────── */
+.model-pick-block {
+  background:var(--surface-2); border:1px solid var(--border); border-radius:8px;
+  padding:.6rem .75rem; display:flex; flex-direction:column; gap:.4rem; margin-bottom:.6rem;
+}
+.model-pick-title {
+  font-size:10px; font-weight:700; letter-spacing:.07em; text-transform:uppercase;
+  color:var(--text-3); margin-bottom:.1rem;
+}
+.model-pick-row { display:flex; align-items:center; gap:.5rem; }
+.model-pick-role {
+  font-size:11px; color:var(--text-2); min-width:7.5rem; flex-shrink:0; line-height:1.3;
+}
+.model-pick-sel {
+  flex:1; padding:.35rem .5rem; border:1px solid var(--border); border-radius:6px;
+  background:var(--surface-1); color:var(--text-1); font-size:12px; cursor:pointer;
+  min-width:0;
+}
+.model-pick-sel:focus { outline:2px solid var(--accent); outline-offset:1px; }
+.model-pick-hint {
+  font-size:10px; color:var(--text-3); display:flex; align-items:center; gap:.55rem; flex-wrap:wrap;
+}
+.mp-ok { color:#4ade80; }
+.mp-warn { color:#f0c060; }
+/* VAE / LoRA optional section */
+.mp-extra { border-top:1px solid var(--border); margin-top:.3rem; padding-top:.4rem; }
+.mp-extra summary {
+  font-size:11px; color:var(--text-2); cursor:pointer; user-select:none;
+  list-style:none; display:flex; align-items:center; gap:.3rem;
+}
+.mp-extra summary::-webkit-details-marker { display:none; }
+.mp-extra summary::before { content:'▶'; font-size:9px; transition:transform .15s; }
+.mp-extra[open] summary::before { transform:rotate(90deg); }
+.mp-extra-body { display:flex; flex-direction:column; gap:.4rem; margin-top:.45rem; }
+.lora-entry { display:flex; align-items:center; gap:.35rem; }
+.lora-weight {
+  width:4.5rem; flex-shrink:0; padding:.3rem .4rem; border:1px solid var(--border);
+  border-radius:5px; background:var(--surface-1); color:var(--text-1); font-size:12px;
+}
+.lora-remove {
+  width:1.5rem; height:1.5rem; flex-shrink:0; display:flex; align-items:center; justify-content:center;
+  border:none; background:transparent; color:var(--text-3); cursor:pointer; font-size:13px;
+  line-height:1; padding:0; border-radius:4px;
+}
+.lora-remove:hover { background:var(--surface-3); color:var(--text-1); }
+
 /* ── Diagnostics / history ────────────────────────────────────── */
 .diag-card, .hist-card {
  border:1px solid var(--border); background:var(--surface-1); border-radius:8px;
@@ -358,6 +405,19 @@ a.dl { display:inline-block; margin-top:.4rem; }
 .prof-voice-actions { display:flex; gap:.4rem; margin-top:.5rem; }
 /* ── Role-picker popup ────────────────────────────────────────── */
 .role-picker-popup { position:fixed; z-index:9999; background:var(--surface-1); border:1px solid var(--border); border-radius:8px; padding:.7rem; box-shadow:0 8px 24px rgba(0,0,0,.5); min-width:200px; max-width:280px; }
+/* ── Profile viewer modal ─────────────────────────────────────── */
+.prof-modal-backdrop { position:fixed; inset:0; background:rgba(0,0,0,.6); z-index:10000; display:flex; align-items:center; justify-content:center; }
+.prof-modal { background:var(--surface-1); border:1px solid var(--border); border-radius:10px; box-shadow:0 12px 40px rgba(0,0,0,.6); width:min(700px,95vw); max-height:85vh; display:flex; flex-direction:column; overflow:hidden; }
+.prof-modal-hd { display:flex; align-items:center; gap:.6rem; padding:.8rem 1rem; border-bottom:1px solid var(--border); flex-shrink:0; }
+.prof-modal-hd h3 { margin:0; font-size:15px; flex:1; color:var(--text-1); }
+.prof-modal-body { padding:1rem; overflow-y:auto; flex:1; }
+.prof-modal-desc { font-size:13px; color:var(--text-2); margin-bottom:.8rem; }
+.prof-modal-imgs { display:flex; flex-wrap:wrap; gap:.5rem; }
+.prof-modal-imgs img { height:140px; width:140px; object-fit:cover; border-radius:6px; cursor:pointer; border:2px solid transparent; transition:border-color .15s; }
+.prof-modal-imgs img:hover { border-color:var(--accent,#4f8ef7); }
+.prof-modal-empty { color:var(--text-3); font-size:13px; font-style:italic; }
+.prof-lightbox { position:fixed; inset:0; background:rgba(0,0,0,.88); z-index:11000; display:flex; align-items:center; justify-content:center; cursor:zoom-out; }
+.prof-lightbox img { max-width:90vw; max-height:90vh; object-fit:contain; border-radius:6px; box-shadow:0 8px 40px rgba(0,0,0,.8); }
 .role-picker-header { font-size:12px; color:var(--text-2); margin-bottom:.5rem; }
 .role-picker-caps { display:flex; flex-direction:column; gap:.3rem; }
 .role-pick-btn { background:var(--surface-2); border:1px solid var(--border); border-radius:5px; color:var(--text-1); padding:.35rem .6rem; font-size:12px; cursor:pointer; font-family:inherit; text-align:left; display:flex; align-items:center; justify-content:space-between; gap:.4rem; }
@@ -2635,6 +2695,188 @@ function getCapabilityDetails(sub) {
  };
 }

+// ── Model pick block state ───────────────────────────────────────────────
+let _mpVae = {};    // { sub: string }  — VAE override per sub
+let _mpLoras = {};  // { sub: [{model, weight, name}] }
+
+const _MP_ROLE_LABELS = {
+  image_generation:'Image generation', image_to_image:'Image editing',
+  inpainting:'Inpainting', image_upscaling:'Upscaler', depth_estimation:'Depth estimator',
+  image_segmentation:'Segmentation', video_generation:'Video generation',
+  image_to_video:'Image → video', video_to_video:'Video editing',
+  video_interpolation:'Frame interpolation', video_upscaling:'Video upscaler',
+  subtitle_generation:'Subtitles', speech_to_text:'Transcription',
+  text_to_speech:'Voice synthesis', audio_generation:'Music / SFX',
+  audio_to_audio:'Voice conversion', embeddings:'Embedding model',
+  text_generation:'Language model', image_to_text:'Vision model',
+};
+
+const _IMAGE_SUBS = new Set(['img-gen','img-edit','img-inpaint','img-upscale','img-depth',
+  'img-seg','img-outfit','img-faceswap','img-deblur','img-unpix','img-to3d','img-from3d']);
+const _VIDEO_SUBS_VL = new Set(['vid-t2v','vid-i2v','vid-v2v','vid-ti2v']);
+
+function _mpBuildOpts(sub, cap) {
+  const assigned = capModelAssignments[sub]?.[cap] || activeModel?.id || '';
+  const capable = modelsForCap(cap);
+  if (!capable.length) {
+    return `<option value="">— no compatible model configured —</option>`;
+  }
+  let opts = `<option value="">— select —</option>`;
+  capable.forEach(m => {
+    const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
+    opts += `<option value="${escapeHtml(m.id)}"${m.id === assigned ? ' selected' : ''}>${escapeHtml(lbl)}</option>`;
+  });
+  return opts;
+}
+
+function _mpBuildComponentOpts(selected, pattern) {
+  const list = models.filter(m => pattern.test(m.id));
+  if (!list.length) return null;
+  let opts = `<option value="">— none —</option>`;
+  list.forEach(m => {
+    const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
+    opts += `<option value="${escapeHtml(m.id)}"${m.id === selected ? ' selected' : ''}>${escapeHtml(lbl)}</option>`;
+  });
+  return opts;
+}
+
+function _mpLoraEntry(sub, i, lora) {
+  const opts = _mpBuildComponentOpts(lora.model || '', /lora/i);
+  const sel = opts
+    ? `<select class="model-pick-sel" onchange="_mpLoraChange('${sub}',${i},'model',this.value)">${opts}</select>`
+    : `<span class="model-pick-sel" style="color:var(--text-3);font-size:11px;display:flex;align-items:center">No LoRA models configured</span>`;
+  return `<div class="lora-entry" id="mp-lora-${sub}-${i}">
+    ${sel}
+    ${opts ? `<input type="number" class="lora-weight" value="${lora.weight??1}" step="0.05" min="0" max="2"
+      title="Weight" onchange="_mpLoraChange('${sub}',${i},'weight',parseFloat(this.value))">` : ''}
+    <button class="lora-remove" onclick="_mpLoraRemove('${sub}',${i})" title="Remove">✕</button>
+  </div>`;
+}
+
+function renderModelPickBlock(sub) {
+  const studioRule = STUDIO_CAPABILITIES[sub];
+  const subRule    = SUB_CAPABILITY_RULES[sub];
+  const isMulti    = studioRule && (studioRule.requires || []).length > 1;
+  const showVaeLora = _IMAGE_SUBS.has(sub) || _VIDEO_SUBS_VL.has(sub);
+
+  // Build the list of caps to show selectors for
+  let caps = [];
+  if (studioRule) {
+    const req = (studioRule.requires || []).filter(c => modelsForCap(c).length > 0);
+    const opt = (studioRule.optional || []).filter(c => modelsForCap(c).length > 0);
+    caps = req.map(c=>({cap:c,required:true})).concat(opt.map(c=>({cap:c,required:false})));
+  } else {
+    const primary = SUB_API_CAP[sub] || (subRule?.requiresAny||subRule?.optional||[])[0];
+    if (primary) caps = [{cap:primary, required:true}];
+  }
+
+  let rows = '';
+  if (caps.length === 0) {
+    rows = `<div class="model-pick-row" style="font-size:12px;color:var(--text-3)">Auto — no explicit capability required.</div>`;
+  } else {
+    rows = caps.map(({cap, required}) => {
+      const roleLabel = _MP_ROLE_LABELS[cap] || cap.replace(/_/g,' ');
+      const opts = _mpBuildOpts(sub, cap);
+      const selId = `mpsel-${sub.replace(/-/g,'_')}-${cap}`;
+      const optLabel = !required && isMulti ? ` <em style="font-weight:400;opacity:.65">(opt)</em>` : '';
+      return `<div class="model-pick-row">
+        ${isMulti || caps.length > 1 ? `<span class="model-pick-role">${escapeHtml(roleLabel)}${optLabel}</span>` : ''}
+        <select class="model-pick-sel" id="${selId}" onchange="_mpSelChange('${sub}','${cap}',this.value)">
+          ${opts}
+        </select>
+      </div>`;
+    }).join('');
+  }
+
+  // VAE / LoRA section
+  let vaeLoraSec = '';
+  if (showVaeLora) {
+    const vaeOpts = _mpBuildComponentOpts(_mpVae[sub] || '', /vae/i);
+    const hasLora = models.some(m => /lora/i.test(m.id));
+    const loraListHtml = (_mpLoras[sub] || []).map((l,i) => _mpLoraEntry(sub, i, l)).join('');
+    const hasAnyComponent = vaeOpts || hasLora;
+    if (hasAnyComponent) {
+      const vaeRow = vaeOpts
+        ? `<div class="model-pick-row">
+            <span class="model-pick-role">VAE</span>
+            <select class="model-pick-sel" id="mpvae-${sub.replace(/-/g,'_')}" onchange="_mpVaeChange('${sub}',this.value)">
+              ${vaeOpts}
+            </select>
+          </div>`
+        : '';
+      const loraSection = hasLora
+        ? `<div class="model-pick-role" style="min-width:0;font-size:10px;color:var(--text-3);margin-top:.15rem">LoRA</div>
+           <div id="mp-loras-${sub}">${loraListHtml}</div>
+           <button class="btn btn-ghost btn-sm" onclick="_mpLoraAdd('${sub}')" style="align-self:flex-start;font-size:11px">+ Add LoRA</button>`
+        : '';
+      vaeLoraSec = `<div class="mp-extra"><details>
+        <summary>VAE / LoRA <em style="font-weight:400;opacity:.65">(optional overrides)</em></summary>
+        <div class="mp-extra-body">
+          ${vaeRow}
+          ${loraSection}
+        </div>
+      </details></div>`;
+    }
+  }
+
+  return `<div class="model-pick-block">
+    <div class="model-pick-title">${isMulti || caps.length > 1 ? 'Models' : 'Model'}</div>
+    ${rows}
+    ${caps.length > 0 ? `<div class="model-pick-hint"><span style="opacity:.6">● loaded in VRAM</span></div>` : ''}
+    ${vaeLoraSec}
+  </div>`;
+}
+
+function _mpSelChange(sub, cap, modelId) {
+  if (!modelId) return;
+  const m = models.find(m => m.id === modelId);
+  if (!m) return;
+  if (cap) assignModelToCap(sub, cap, m);
+  else selectSubModel(sub, m);
+}
+
+function _mpVaeChange(sub, value) { _mpVae[sub] = value || null; }
+
+function _mpLoraChange(sub, i, field, value) {
+  if (!_mpLoras[sub]) _mpLoras[sub] = [];
+  if (!_mpLoras[sub][i]) _mpLoras[sub][i] = {model:'', weight:1.0};
+  _mpLoras[sub][i][field] = value;
+}
+
+function _mpLoraAdd(sub) {
+  if (!_mpLoras[sub]) _mpLoras[sub] = [];
+  _mpLoras[sub].push({model:'', weight:1.0});
+  const el = document.getElementById(`mp-loras-${sub}`);
+  if (el) el.innerHTML = _mpLoras[sub].map((l,i) => _mpLoraEntry(sub, i, l)).join('');
+}
+
+function _mpLoraRemove(sub, idx) {
+  if (_mpLoras[sub]) {
+    _mpLoras[sub].splice(idx, 1);
+    const el = document.getElementById(`mp-loras-${sub}`);
+    if (el) el.innerHTML = _mpLoras[sub].map((l,i) => _mpLoraEntry(sub, i, l)).join('');
+  }
+}
+
+function _mpSyncSelect(sub, cap, modelId) {
+  const selId = `mpsel-${sub.replace(/-/g,'_')}-${cap}`;
+  const el = document.getElementById(selId);
+  if (el && modelId) el.value = modelId;
+}
+
+function getVaeForSub(sub) { return _mpVae[sub] || null; }
+function getLorasForSub(sub) {
+  return (_mpLoras[sub] || [])
+    .filter(l => l.model)
+    .map((l, i) => ({
+      model: l.model,
+      weight: l.weight ?? 1.0,
+      name: l.name || l.model.split('/').pop().replace(/[^a-zA-Z0-9_-]/g,'_'),
+    }));
+}
+
+// ── end model pick block ─────────────────────────────────────────────────
+
 function renderCapabilityCard(sub) {
  const shell = $(`cap-${sub}`);
  if (!shell) return;
@@ -2667,9 +2909,9 @@ function renderCapabilityCard(sub) {
      <span class="cap-chip">${details.backendPath}</span>
      <span class="cap-chip">${details.io}</span>
    </div>
+    ${renderModelPickBlock(sub)}
    ${missingBits.join('')}
    ${notes}
-    ${renderSubModelPicker(sub)}
  `;
 }

@@ -2681,16 +2923,11 @@ function renderCapabilityCards() {
    const shell = $(`cap-${sub}`);
    if (!shell) return;
    const state = currentTabState.subs[sub] || 'unavailable';
-    const picker = renderSubModelPicker(sub);
+    const picker = renderModelPickBlock(sub);
    if (state === 'available') {
-      if (picker) {
-        shell.style.display = '';
-        shell.classList.remove('state-partial', 'state-unavailable');
-        shell.innerHTML = picker;
-      } else {
-        shell.style.display = 'none';
-        shell.innerHTML = '';
-      }
+      shell.style.display = '';
+      shell.classList.remove('state-partial', 'state-unavailable');
+      shell.innerHTML = picker;
      return;
    }
    shell.style.display = '';
@@ -2711,8 +2948,8 @@ function renderCapabilityCards() {
        <div class="cap-card-title">${escapeHtml(label)}</div>
        <span class="cap-chip${availabilityClass}">${availabilityLabel}</span>
      </div>
-      ${missingBits.join('')}
      ${picker}
+      ${missingBits.join('')}
    `;
  });
  renderAudioBackendHealth();
@@ -2966,7 +3203,7 @@ function previewExportBody(endpoint, body) {

 function buildAudioPreviewData() {
  return previewExportBody(ROOT_PATH + '/v1/audio/generate', {
-    model: activeModel?.id || '',
+    model: modelForSub('aud-gen') || '',
    prompt: val('ag-prompt'),
    duration: fval('ag-dur') || 10,
    temperature: fval('ag-temp') || 1.0,
@@ -2980,7 +3217,7 @@ function buildAudioPreviewData() {

 function buildTTSPreviewData() {
  return previewExportBody(ROOT_PATH + '/v1/audio/speech', {
-    model: activeModel?.id || '',
+    model: modelForSub('aud-tts') || '',
    input: val('at-text'),
    voice: val('at-voice') || undefined,
    speed: fval('at-speed') || 1.0,
@@ -3000,8 +3237,10 @@ function buildSTTPreviewData() {
 }

 function buildImageGenPreviewData() {
+  const loras = getLorasForSub('img-gen');
+  const vae = getVaeForSub('img-gen');
  return previewExportBody(ROOT_PATH + '/v1/images/generations', {
-    model: activeModel?.id || '',
+    model: modelForSub('img-gen') || '',
    prompt: val('ig-prompt'),
    negative_prompt: val('ig-neg') || undefined,
    size: `${ival('ig-w') || 1024}x${ival('ig-h') || 1024}`,
@@ -3011,6 +3250,8 @@ function buildImageGenPreviewData() {
    n: ival('ig-n') || 1,
    response_format: 'url',
    safety_checker: chk('ig-nosafe') ? false : undefined,
+    ...(vae ? {vae_model: vae} : {}),
+    ...(loras.length ? {loras} : {}),
  });
 }

@@ -3018,7 +3259,7 @@ function buildEmbeddingsPreviewData() {
  const lines = val('em-text').split('\n').filter(l => l.trim());
  const input = lines.length <= 1 ? (lines[0] || '') : lines;
  return previewExportBody(ROOT_PATH + '/v1/embeddings', {
-    model: activeModel?.id || '',
+    model: modelForSub('embed') || '',
    input,
    encoding_format: val('em-enc') || 'float',
    dimensions: val('em-dims') ? ival('em-dims') : undefined,
@@ -3190,6 +3431,8 @@ const REQUEST_PREVIEW_CONFIG = {
      { label:'Steps', value:preview => preview.body.steps },
      { label:'CFG', value:preview => preview.body.guidance_scale },
      { label:'Count', value:preview => preview.body.n },
+      { label:'VAE', value:preview => preview.body.vae_model || '—' },
+      { label:'LoRA', value:preview => preview.body.loras?.length ? preview.body.loras.map(l=>l.model.split('/').pop()).join(', ') : '—' },
    ],
  },
  'embed': {
@@ -3333,6 +3576,7 @@ async function loadModels() {
    models = deduplicateModels(d.data || []);
    renderSidebar();
    if (models.length) selectModel(models[0]);
+    pcPopulateModelSelect(); pePopulateModelSelect();
  } catch(e) {
    $('model-list').innerHTML = '<div class="muted small" style="padding:.5rem .6rem">Failed to load models</div>';
  }
@@ -3467,6 +3711,8 @@ function selectCat(cat) {
  document.querySelectorAll('.t1btn').forEach(b => b.classList.toggle('active', b.dataset.cat === cat));
  const hasL2 = ['image','video','audio','3d','profiles'].includes(cat);
  $('tabbar2').classList.toggle('visible', hasL2);
+  const isChatLike = cat === 'chat' || cat === 'embed';
+  document.querySelector('.sidebar')?.classList.toggle('hidden', !isChatLike);
  if (!hasL2) {
    clearSidebarHighlights();
    document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
@@ -3480,7 +3726,7 @@ function selectCat(cat) {
    btn.dataset.catVisible = belongsHere ? cat : '';
    btn.classList.toggle('state-hidden', !belongsHere);
  });
-  if (cat === 'profiles') { profCharLoad(); profEnvLoad(); profVoiceLoad(); }
+  if (cat === 'profiles') { profCharLoad(); profEnvLoad(); profVoiceLoad(); pcPopulateModelSelect(); pePopulateModelSelect(); }
  const activeSub = document.querySelector('.t2btn.active');
  const activeSubFits = activeSub && isSubVisibleForCategory(activeSub.dataset.sub, cat);
  const nextSub = activeSubFits ? activeSub.dataset.sub : getFirstVisibleSub(cat)?.dataset.sub;
@@ -3575,6 +3821,8 @@ function assignModelToCap(sub, cap, model) {
    document.querySelectorAll('.model-item').forEach(el =>
      el.classList.toggle('active', el.dataset.id === model.id));
  }
+  // Sync the select element in the panel (if rendered, avoids full re-render).
+  if (cap) _mpSyncSelect(sub, cap, model.id);
  renderCapabilityCards();
  if (SUB_CAT[sub]) highlightSidebarForSub(sub);
 }
@@ -4273,6 +4521,8 @@ async function genImage() {
  try {
    const igCharProfiles = getCharProfilesList('ig');
    const igEnvProfiles = getEnvProfilesList('ig');
+    const _igLoras = getLorasForSub('img-gen');
+    const _igVae = getVaeForSub('img-gen');
    const d = await post('/v1/images/generations', {
      model:modelForSub('img-gen'), prompt:val('ig-prompt'),
      size:val('ig-w')+'x'+val('ig-h'),
@@ -4282,6 +4532,8 @@ async function genImage() {
      ...(val('ig-neg') ? {negative_prompt:val('ig-neg')} : {}),
      disable_safety_checker: chk('ig-nosafe'),
      response_format:'url',
+      ...(_igVae ? {vae_model:_igVae} : {}),
+      ...(_igLoras.length ? {loras:_igLoras} : {}),
      ...(igCharProfiles.length ? {character_profiles:igCharProfiles, character_strength:fval('ig-char-str')||0.6} : {}),
      ...(igEnvProfiles.length ? {environment_profiles:igEnvProfiles, environment_strength:fval('ig-env-str')||0.6} : {}),
    });
@@ -5633,12 +5885,12 @@ function profCharSubmit() {
 function pcPopulateModelSelect() {
  const sel = $('pc-gen-model'); if (!sel) return;
  const cur = sel.value;
-  // Collect image-capable models from the cached model list
-  const opts = ['<option value="">Default image model</option>'];
+  const opts = ['<option value="">— select model —</option>'];
  (models || []).forEach(m => {
    const caps = m.capabilities || [];
-    if (caps.includes('image_generation') || caps.includes('image_to_image')) {
-      opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(m.id)}</option>`);
+    if (caps.includes('image_generation')) {
+      const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
+      opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(lbl)}</option>`);
    }
  });
  sel.innerHTML = opts.join('');
@@ -5763,11 +6015,43 @@ function renderCharList() {
  }).join('');
 }

+function _openLightbox(src) {
+  const lb = document.createElement('div');
+  lb.className = 'prof-lightbox';
+  lb.innerHTML = `<img src="${escapeHtml(src)}">`;
+  lb.addEventListener('click', () => lb.remove());
+  document.body.appendChild(lb);
+}
+
+function _openProfModal(title, description, images) {
+  const existing = document.getElementById('prof-view-modal');
+  if (existing) existing.remove();
+  const imgHtml = images.length
+    ? images.map((img, i) => `<img src="${escapeHtml(img.data)}" title="${escapeHtml(img.label||`image ${i+1}`)}" data-src="${escapeHtml(img.data)}" onclick="_openLightbox(this.dataset.src)">`).join('')
+    : `<div class="prof-modal-empty">No images stored.</div>`;
+  const backdrop = document.createElement('div');
+  backdrop.id = 'prof-view-modal';
+  backdrop.className = 'prof-modal-backdrop';
+  backdrop.innerHTML = `
+    <div class="prof-modal">
+      <div class="prof-modal-hd">
+        <h3>${escapeHtml(title)}</h3>
+        <button class="btn btn-ghost btn-sm" onclick="document.getElementById('prof-view-modal').remove()">✕ Close</button>
+      </div>
+      <div class="prof-modal-body">
+        ${description ? `<div class="prof-modal-desc">${escapeHtml(description)}</div>` : ''}
+        <div class="prof-modal-imgs">${imgHtml}</div>
+      </div>
+    </div>`;
+  backdrop.addEventListener('click', e => { if (e.target === backdrop) backdrop.remove(); });
+  document.body.appendChild(backdrop);
+}
+
 async function profCharView(name) {
-  const d = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name)).then(r=>r.json());
-  const imgs = (d.images||[]).map(img=>`<img src="${img.data}" style="height:80px;border-radius:4px;object-fit:cover" title="${escapeHtml(img.label||'')}">`).join('');
-  alert(`Character: ${d.name}\nDescription: ${d.description||'—'}\nImages: ${d.image_count}\n\n(Images are shown in console; open DevTools to inspect)`);
-  console.log('[profCharView]', d.name, d);
+  try {
+    const d = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name)).then(r=>r.json());
+    _openProfModal(`Character: ${d.name}`, d.description||'', d.images||[]);
+  } catch(e) { alert('Failed to load character: ' + e.message); }
 }

 async function profCharDelete(name) {
@@ -5892,11 +6176,12 @@ function profEnvSubmit() {
 function pePopulateModelSelect() {
  const sel = $('pe-gen-model'); if (!sel) return;
  const cur = sel.value;
-  const opts = ['<option value="">Default image model</option>'];
+  const opts = ['<option value="">— select model —</option>'];
  (models || []).forEach(m => {
    const caps = m.capabilities || [];
-    if (caps.includes('image_generation') || caps.includes('image_to_image')) {
-      opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(m.id)}</option>`);
+    if (caps.includes('image_generation')) {
+      const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
+      opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(lbl)}</option>`);
    }
  });
  sel.innerHTML = opts.join('');
@@ -6008,9 +6293,10 @@ function renderEnvList() {
 }

 async function profEnvView(name) {
-  const d = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name)).then(r=>r.json());
-  alert(`Environment: ${d.name}\nDescription: ${d.description||'—'}\nImages: ${d.image_count}\n\n(Images are shown in console; open DevTools to inspect)`);
-  console.log('[profEnvView]', d.name, d);
+  try {
+    const d = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name)).then(r=>r.json());
+    _openProfModal(`Environment: ${d.name}`, d.description||'', d.images||[]);
+  } catch(e) { alert('Failed to load environment: ' + e.message); }
 }

 async function profEnvDelete(name) {

--- a/codai/admin/templates/settings.html
+++ b/codai/admin/templates/settings.html
@@ -102,6 +102,57 @@
  </div>
 </div>

+<!-- Thermal protection -->
+<div class="card mb-0" style="margin-top:1rem">
+  <div class="card-title">Thermal Protection</div>
+  <span class="form-hint" style="display:block;margin-bottom:.75rem">
+    Before serving a request against a loaded model, wait until temperatures are
+    safe so a long sequence of heavy generations can't overheat the machine and
+    trip its power-off protection. The wait is non-blocking (other requests keep
+    being accepted) and takes effect immediately on save. Temperatures in °C.
+  </span>
+
+  <div class="form-row">
+    <label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
+      <input type="checkbox" id="s-therm-gpu-enabled" onchange="toggleThermalFields()">
+      <span style="font-size:13px;font-weight:500">Enable GPU temperature protection</span>
+    </label>
+  </div>
+  <div id="therm-gpu-fields" class="form-row" style="display:grid;grid-template-columns:1fr 1fr;gap:1rem">
+    <div>
+      <label class="form-label">Pause when GPU reaches (°C)</label>
+      <input type="number" id="s-therm-gpu-high" class="form-input" min="40" max="120" step="1" placeholder="90">
+    </div>
+    <div>
+      <label class="form-label">Resume when GPU drops to (°C)</label>
+      <input type="number" id="s-therm-gpu-resume" class="form-input" min="30" max="120" step="1" placeholder="87">
+    </div>
+  </div>
+
+  <div class="form-row" style="margin-top:.5rem">
+    <label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
+      <input type="checkbox" id="s-therm-cpu-enabled" onchange="toggleThermalFields()">
+      <span style="font-size:13px;font-weight:500">Enable CPU temperature protection</span>
+    </label>
+  </div>
+  <div id="therm-cpu-fields" class="form-row" style="display:grid;grid-template-columns:1fr 1fr;gap:1rem">
+    <div>
+      <label class="form-label">Pause when CPU reaches (°C)</label>
+      <input type="number" id="s-therm-cpu-high" class="form-input" min="40" max="120" step="1" placeholder="90">
+    </div>
+    <div>
+      <label class="form-label">Resume when CPU drops to (°C)</label>
+      <input type="number" id="s-therm-cpu-resume" class="form-input" min="30" max="120" step="1" placeholder="87">
+    </div>
+  </div>
+
+  <div class="form-row" style="margin:0">
+    <label class="form-label">Re-check interval while cooling down (seconds)</label>
+    <input type="number" id="s-therm-poll" class="form-input" style="max-width:200px" min="1" max="120" step="1" placeholder="5">
+    <span class="form-hint">How often to re-read temperatures while waiting for cooldown.</span>
+  </div>
+</div>
+
 <div class="card mb-0" style="margin-top:1rem">
  <div class="card-title">AISBF Broker</div>
  <div class="form-row">
@@ -210,6 +261,13 @@ function toggleBrokerFields(){
  }
 }

+function toggleThermalFields(){
+  document.getElementById('therm-gpu-fields').style.display =
+    document.getElementById('s-therm-gpu-enabled').checked ? 'grid' : 'none';
+  document.getElementById('therm-cpu-fields').style.display =
+    document.getElementById('s-therm-cpu-enabled').checked ? 'grid' : 'none';
+}
+
 function showAlert(type, msg){
  const el = document.getElementById('settings-alert');
  el.className = 'alert alert-' + (type === 'error' ? 'error' : 'info');
@@ -260,6 +318,16 @@ async function loadSettings(){
    document.getElementById('s-broker-reconnect-max').value = broker.reconnect_max_delay_seconds ?? 60;
    document.getElementById('s-broker-ws-ping').value = broker.websocket_ping_interval ?? 20;
    toggleBrokerFields();
+    // Thermal protection
+    const therm = d.thermal || {};
+    document.getElementById('s-therm-gpu-enabled').checked = therm.gpu_enabled !== false;
+    document.getElementById('s-therm-cpu-enabled').checked = therm.cpu_enabled !== false;
+    document.getElementById('s-therm-gpu-high').value = therm.gpu_high ?? 90;
+    document.getElementById('s-therm-gpu-resume').value = therm.gpu_resume ?? 87;
+    document.getElementById('s-therm-cpu-high').value = therm.cpu_high ?? 90;
+    document.getElementById('s-therm-cpu-resume').value = therm.cpu_resume ?? 87;
+    document.getElementById('s-therm-poll').value = therm.poll_seconds ?? 5;
+    toggleThermalFields();
  }catch(e){ showAlert('error','Failed to load settings: '+e.message); }
 }

@@ -286,6 +354,15 @@ async function saveSettings(){
      directory: document.getElementById('s-arc-dir').value.trim(),
      retention: document.getElementById('s-arc-retention').value,
    },
+    thermal:{
+      gpu_enabled: document.getElementById('s-therm-gpu-enabled').checked,
+      cpu_enabled: document.getElementById('s-therm-cpu-enabled').checked,
+      gpu_high:   parseFloat(document.getElementById('s-therm-gpu-high').value)   || 90,
+      gpu_resume: parseFloat(document.getElementById('s-therm-gpu-resume').value) || 87,
+      cpu_high:   parseFloat(document.getElementById('s-therm-cpu-high').value)   || 90,
+      cpu_resume: parseFloat(document.getElementById('s-therm-cpu-resume').value) || 87,
+      poll_seconds: parseFloat(document.getElementById('s-therm-poll').value) || 5,
+    },
    broker:{
      enabled: document.getElementById('s-broker-enabled').checked,
      base_url: document.getElementById('s-broker-base-url').value.trim(),
@@ -310,7 +387,7 @@ async function saveSettings(){
      method:'POST', headers:{'Content-Type':'application/json'},
      body: JSON.stringify(data)
    });
-    if(r.ok) showAlert('info','Settings saved. Archive changes take effect immediately; restart CoderAI for other changes.');
+    if(r.ok) showAlert('info','Settings saved. Archive and thermal-protection changes take effect immediately; restart CoderAI for other changes.');
    else{ const e=await r.json(); showAlert('error', e.detail||'Save failed'); }
  }catch(e){ showAlert('error','Error: '+e.message); }
 }

--- a/codai/api/app.py
+++ b/codai/api/app.py
@@ -139,6 +139,7 @@ from codai.api.voice_clone import router as voice_clone_router
 from codai.api.voice_convert import router as voice_convert_router
 from codai.api.faceswap import router as faceswap_router
 from codai.api.characters import router as characters_router
+from codai.api.loras import router as loras_router
 from codai.api.spatial import router as spatial_router
 from codai.api.environments import router as environments_router
 from codai.admin.routes import router as admin_router
@@ -203,6 +204,7 @@ app.include_router(voice_clone_router)
 app.include_router(voice_convert_router)
 app.include_router(faceswap_router)
 app.include_router(characters_router)
+app.include_router(loras_router)
 app.include_router(environments_router)
 app.include_router(spatial_router)
 app.include_router(admin_router)

--- a/codai/api/audio_gen.py
+++ b/codai/api/audio_gen.py
@@ -119,11 +119,35 @@ def _load_musicgen(model_name: str, device: str):
    return model


-def _load_audioldm(model_name: str, device: str):
+def _load_audioldm(model_name: str, device: str, model_config: dict = None):
    import torch
    from diffusers import AudioLDM2Pipeline
-    pipe = AudioLDM2Pipeline.from_pretrained(model_name, torch_dtype=torch.float16)
-    pipe = pipe.to(device)
+    from codai.models.hf_loading import resolve_dtype
+    dtype = resolve_dtype(model_config, default='f16')
+    _xtra = {}
+    # Apply 4-bit/8-bit quantization to the diffusion backbone when configured.
+    _mc = model_config or {}
+    if _mc.get('load_in_4bit') or _mc.get('load_in_8bit'):
+        _bits = 4 if _mc.get('load_in_4bit') else 8
+        try:
+            from diffusers.quantizers import PipelineQuantizationConfig
+            _qk = ({'load_in_4bit': True, 'bnb_4bit_compute_dtype': dtype}
+                   if _mc.get('load_in_4bit') else {'load_in_8bit': True})
+            _xtra['quantization_config'] = PipelineQuantizationConfig(
+                quant_backend=f"bitsandbytes_{_bits}bit",
+                quant_kwargs=_qk,
+                components_to_quantize=["transformer", "unet"],
+            )
+            print(f"AudioLDM quantization: {_bits}-bit (bitsandbytes)")
+        except Exception as e:
+            print(f"AudioLDM quantization unavailable: {e}")
+    pipe = AudioLDM2Pipeline.from_pretrained(model_name, torch_dtype=dtype, **_xtra)
+    # CPU offload when configured; otherwise place on device (skip for quantized).
+    _off = _mc.get('offload_strategy')
+    if _off in ('cpu', 'sequential', 'model', 'disk') and hasattr(pipe, 'enable_model_cpu_offload'):
+        pipe.enable_model_cpu_offload()
+    elif 'quantization_config' not in _xtra:
+        pipe = pipe.to(device)
    return pipe


@@ -224,7 +248,8 @@ async def audio_generate(request: AudioGenerationRequest, http_request: Request
    Compatible models: MusicGen, AudioGen, AudioLDM2, StableAudio.
    """
    _aud_progress_loading(request.model or "audio")
-    model_info = multi_model_manager.request_model(request.model, model_type="audio_gen")
+    model_info = await asyncio.to_thread(
+        multi_model_manager.request_model, request.model, model_type="audio_gen")
    model_name = model_info.get('model_name')
    if not model_name:
        err = model_info.get('error', f"Model '{request.model}' not found")
@@ -236,13 +261,14 @@ async def audio_generate(request: AudioGenerationRequest, http_request: Request
    if pipe is None:
        device = _derive_device()
        model_type = _detect_audio_gen_type(model_name)
+        _ag_cfg = model_info.get('config') or {}
        try:
            if model_type in ('musicgen', 'audiogen'):
                pipe = await asyncio.get_event_loop().run_in_executor(
                    None, _load_musicgen, model_name, device)
            else:
                pipe = await asyncio.get_event_loop().run_in_executor(
-                    None, _load_audioldm, model_name, device)
+                    None, _load_audioldm, model_name, device, _ag_cfg)
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"Failed to load audio gen model: {e}")
        multi_model_manager.models[model_key] = pipe

--- a/codai/api/characters.py
+++ b/codai/api/characters.py
@@ -37,9 +37,38 @@ import tempfile
 import time
 from typing import List, Optional

-from fastapi import APIRouter, HTTPException, Request
+from fastapi import APIRouter, Depends, HTTPException, Request
 from pydantic import BaseModel, ConfigDict

+
+def _require_api_auth(request: Request) -> None:
+    """Raise 401 if auth is enabled and the request carries no valid credential."""
+    try:
+        from codai.admin import routes as _admin_routes
+        sm = _admin_routes.session_manager
+    except Exception:
+        return  # auth subsystem unavailable — allow through
+    if sm is None:
+        return  # auth not configured on this instance
+
+    auth = request.headers.get("authorization", "")
+    if auth.lower().startswith("bearer "):
+        token = auth[7:].strip()
+        if sm.verify_token(token):
+            return
+
+    cookie = request.cookies.get("session", "")
+    if cookie.endswith(".MUST_CHANGE"):
+        cookie = cookie[:-12]
+    if cookie and sm.validate_session(cookie):
+        return
+
+    raise HTTPException(
+        status_code=401,
+        detail={"message": "Invalid API key. Provide a valid Bearer token.",
+                "type": "invalid_request_error", "code": "invalid_api_key"},
+    )
+
 from codai.platform_paths import default_characters_dir, legacy_style_config_dir

 router = APIRouter()
@@ -211,7 +240,12 @@ def _decode_source(data: str) -> bytes:


 def _detect_faces_cv2(img_bytes: bytes):
-    """Return list of (x,y,w,h) face rects using Haar cascade, or [] if cv2 unavailable."""
+    """
+    Return list of (x,y,w,h) face rects, largest first.
+    Tries MediaPipe (most accurate), then OpenCV DNN, then Haar cascade as fallback.
+    Detections smaller than 2% of image area are discarded as false positives.
+    Returns [] if no library is available or no plausible face is found.
+    """
    try:
        import cv2
        import numpy as np
@@ -219,19 +253,56 @@ def _detect_faces_cv2(img_bytes: bytes):
        img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
        if img is None:
            return []
+        ih, iw = img.shape[:2]
+        img_area = ih * iw
+        min_face_area = img_area * 0.02  # reject anything < 2% of image
+
+        # ── Try MediaPipe first (most accurate, no model download needed) ──
+        try:
+            import mediapipe as mp
+            mp_face = mp.solutions.face_detection
+            with mp_face.FaceDetection(model_selection=1, min_detection_confidence=0.5) as det:
+                rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
+                results = det.process(rgb)
+            if results.detections:
+                rects = []
+                for d in results.detections:
+                    bb = d.location_data.relative_bounding_box
+                    x = int(bb.xmin * iw)
+                    y = int(bb.ymin * ih)
+                    w = int(bb.width * iw)
+                    h = int(bb.height * ih)
+                    if w * h >= min_face_area:
+                        rects.append((x, y, w, h))
+                if rects:
+                    rects.sort(key=lambda r: r[2]*r[3], reverse=True)
+                    return rects
+        except ImportError:
+            pass
+
+        # ── Haar cascade fallback (stricter parameters to reduce false positives) ──
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+        gray = cv2.equalizeHist(gray)
        cascade_path = cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
        cascade = cv2.CascadeClassifier(cascade_path)
-        faces = cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(40, 40))
+        # minSize scaled to image: at least 8% of the shorter dimension
+        min_dim = int(min(iw, ih) * 0.08)
+        faces = cascade.detectMultiScale(
+            gray, scaleFactor=1.05, minNeighbors=8,
+            minSize=(max(40, min_dim), max(40, min_dim)),
+        )
        if len(faces) == 0:
            return []
-        return [(int(x), int(y), int(w), int(h)) for x, y, w, h in faces]
+        rects = [(int(x), int(y), int(w), int(h)) for x, y, w, h in faces
+                 if int(w) * int(h) >= min_face_area]
+        rects.sort(key=lambda r: r[2]*r[3], reverse=True)
+        return rects
    except Exception:
        return []


 def _crop_face(img_bytes: bytes, rect) -> Optional[bytes]:
-    """Crop a face rect (with padding) from an image, return PNG bytes."""
+    """Crop a face rect with generous padding (head-and-shoulders), return PNG bytes."""
    try:
        import cv2
        import numpy as np
@@ -241,11 +312,15 @@ def _crop_face(img_bytes: bytes, rect) -> Optional[bytes]:
        if img is None:
            return None
        ih, iw = img.shape[:2]
-        pad = int(max(w, h) * 0.4)
-        x1 = max(0, x - pad)
-        y1 = max(0, y - pad)
-        x2 = min(iw, x + w + pad)
-        y2 = min(ih, y + h + pad)
+        side = max(w, h)
+        # More padding on top to include hair/forehead, less at bottom
+        pad_sides = int(side * 0.5)
+        pad_top   = int(side * 0.7)
+        pad_bot   = int(side * 0.4)
+        x1 = max(0, x - pad_sides)
+        y1 = max(0, y - pad_top)
+        x2 = min(iw, x + w + pad_sides)
+        y2 = min(ih, y + h + pad_bot)
        crop = img[y1:y2, x1:x2]
        ok, buf = cv2.imencode('.png', crop)
        return bytes(buf) if ok else None
@@ -274,7 +349,7 @@ def _extract_from_image(img_bytes: bytes) -> List[bytes]:
        crops = [c for f in faces for c in [_crop_face(img_bytes, f)] if c]
        if crops:
            return crops
-    # No face detected — use whole image as reference
+    # No face detected (or all detections filtered as false positives) — use whole image
    try:
        from PIL import Image as PILImage
        img = PILImage.open(io.BytesIO(img_bytes)).convert('RGB')
@@ -345,7 +420,7 @@ def resolve_character_profiles(profile_names: List[str]) -> List[str]:
 # ── Endpoints ─────────────────────────────────────────────────────────────────

 @router.post("/v1/characters")
-async def save_character(req: CharacterSaveRequest):
+async def save_character(req: CharacterSaveRequest, _auth=Depends(_require_api_auth)):
    """Save or update a named character profile."""
    if not req.name or '/' in req.name or '..' in req.name:
        raise HTTPException(status_code=400, detail="Invalid character name")
@@ -356,13 +431,13 @@ async def save_character(req: CharacterSaveRequest):


 @router.get("/v1/characters")
-async def list_characters():
+async def list_characters(_auth=Depends(_require_api_auth)):
    """List all saved character profiles (metadata only, no images)."""
    return {"characters": _list_characters()}


 @router.get("/v1/characters/{name}")
-async def get_character(name: str):
+async def get_character(name: str, _auth=Depends(_require_api_auth)):
    """Get a character profile including its reference images as base64."""
    meta = _load_character_meta(name)
    if not meta:
@@ -378,7 +453,7 @@ async def get_character(name: str):


 @router.delete("/v1/characters/{name}")
-async def delete_character(name: str):
+async def delete_character(name: str, _auth=Depends(_require_api_auth)):
    """Delete a character profile."""
    cdir = _char_dir(name)
    if not os.path.isdir(cdir):
@@ -389,7 +464,7 @@ async def delete_character(name: str):


 @router.patch("/v1/characters/{name}")
-async def patch_character(name: str, req: CharacterPatchRequest):
+async def patch_character(name: str, req: CharacterPatchRequest, _auth=Depends(_require_api_auth)):
    """Update a character profile: description, add images, or remove images by index."""
    meta = _load_character_meta(name)
    if not meta:
@@ -462,29 +537,24 @@ async def generate_character(req: CharacterGenerateRequest, request: Request):
    if req.steps:
        payload["steps"] = req.steps

-    # Forward the caller's auth token so rate-limit / auth middleware passes
-    auth_header = request.headers.get("authorization", "")
-    headers = {"Content-Type": "application/json"}
-    if auth_header:
-        headers["Authorization"] = auth_header
-
    try:
-        from httpx import AsyncClient, ASGITransport
-        async with AsyncClient(
-            transport=ASGITransport(app=request.app),
-            base_url="http://internal",
-            timeout=300,
-        ) as client:
-            r = await client.post("/v1/images/generations", json=payload, headers=headers)
-
-        if not r.is_success:
+        import json as _json
+        from codai.broker.asgi_bridge import execute_internal_request
+        resp = await execute_internal_request(
+            request.app,
+            method="POST",
+            path="/v1/images/generations",
+            headers={"Content-Type": "application/json"},
+            body=_json.dumps(payload).encode(),
+        )
+        if resp["status_code"] >= 400:
            try:
-                detail = r.json().get("detail", r.text)
+                detail = _json.loads(resp["body"]).get("detail", resp["body"].decode())
            except Exception:
-                detail = r.text
-            raise HTTPException(status_code=r.status_code, detail=f"Image generation failed: {detail}")
+                detail = resp["body"].decode()
+            raise HTTPException(status_code=resp["status_code"], detail=f"Image generation failed: {detail}")

-        images_data = r.json().get("data", [])
+        images_data = _json.loads(resp["body"]).get("data", [])
    except HTTPException:
        raise
    except Exception as e:

--- a/codai/api/embeddings.py
+++ b/codai/api/embeddings.py
@@ -48,10 +48,16 @@ def _derive_device() -> str:
    return "cuda:0"


-def _load_embedding_model(model_name: str, device: str):
+def _load_embedding_model(model_name: str, device: str, model_config: dict = None):
+    from codai.models.hf_loading import build_from_pretrained_kwargs
    try:
        from sentence_transformers import SentenceTransformer
-        model = SentenceTransformer(model_name, device=device)
+        # sentence-transformers honours quantization via model_kwargs.
+        fp = build_from_pretrained_kwargs(model_config)
+        st_kwargs = {}
+        if 'quantization_config' in fp:
+            st_kwargs['model_kwargs'] = {'quantization_config': fp['quantization_config']}
+        model = SentenceTransformer(model_name, device=device, **st_kwargs)
        return ('sentence_transformers', model)
    except ImportError:
        pass
@@ -59,8 +65,11 @@ def _load_embedding_model(model_name: str, device: str):
    try:
        from transformers import AutoTokenizer, AutoModel
        import torch
+        fp = build_from_pretrained_kwargs(model_config)
        tokenizer = AutoTokenizer.from_pretrained(model_name)
-        model = AutoModel.from_pretrained(model_name).to(device)
+        model = AutoModel.from_pretrained(model_name, **fp)
+        if 'quantization_config' not in fp and 'device_map' not in fp:
+            model = model.to(device)
        return ('transformers', (tokenizer, model, device))
    except Exception as e:
        raise RuntimeError(f"Cannot load embedding model '{model_name}': {e}")
@@ -97,7 +106,8 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
    """
    OpenAI-compatible embeddings endpoint.
    """
-    model_info = multi_model_manager.request_model(request.model, model_type="embedding")
+    model_info = await asyncio.to_thread(
+        multi_model_manager.request_model, request.model, model_type="embedding")
    model_name = model_info.get('model_name')
    if not model_name:
        err = model_info.get('error', f"Model '{request.model}' not found")
@@ -108,9 +118,11 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =

    if model_obj is None:
        device = _derive_device()
+        _emb_cfg = (multi_model_manager.config.get(f"embedding:{model_name}")
+                    or multi_model_manager.config.get(model_name) or {})
        try:
            model_obj = await asyncio.get_event_loop().run_in_executor(
-                None, _load_embedding_model, model_name, device)
+                None, _load_embedding_model, model_name, device, _emb_cfg)
        except Exception as e:
            raise HTTPException(status_code=500, detail=f"Failed to load embedding model: {e}")
        multi_model_manager.models[model_key] = model_obj

--- a/codai/api/environments.py
+++ b/codai/api/environments.py
@@ -39,7 +39,32 @@ import tempfile
 import time
 from typing import List, Optional

-from fastapi import APIRouter, HTTPException, Request
+from fastapi import APIRouter, Depends, HTTPException, Request
+
+
+def _require_api_auth(request: Request) -> None:
+    """Raise 401 if auth is enabled and the request carries no valid credential."""
+    try:
+        from codai.admin import routes as _admin_routes
+        sm = _admin_routes.session_manager
+    except Exception:
+        return
+    if sm is None:
+        return
+    auth = request.headers.get("authorization", "")
+    if auth.lower().startswith("bearer "):
+        if sm.verify_token(auth[7:].strip()):
+            return
+    cookie = request.cookies.get("session", "")
+    if cookie.endswith(".MUST_CHANGE"):
+        cookie = cookie[:-12]
+    if cookie and sm.validate_session(cookie):
+        return
+    raise HTTPException(
+        status_code=401,
+        detail={"message": "Invalid API key. Provide a valid Bearer token.",
+                "type": "invalid_request_error", "code": "invalid_api_key"},
+    )
 from pydantic import BaseModel, ConfigDict

 from codai.platform_paths import default_environments_dir, legacy_style_config_dir
@@ -283,7 +308,7 @@ def resolve_environment_profiles(profile_names: List[str]) -> List[str]:
 # ── Endpoints ─────────────────────────────────────────────────────────────────

 @router.post("/v1/environments")
-async def save_environment(req: EnvironmentSaveRequest):
+async def save_environment(req: EnvironmentSaveRequest, _auth=Depends(_require_api_auth)):
    """Save or update a named environment profile."""
    if not req.name or '/' in req.name or '..' in req.name:
        raise HTTPException(status_code=400, detail="Invalid environment name")
@@ -294,13 +319,13 @@ async def save_environment(req: EnvironmentSaveRequest):


 @router.get("/v1/environments")
-async def list_environments():
+async def list_environments(_auth=Depends(_require_api_auth)):
    """List all saved environment profiles (metadata only)."""
    return {"environments": _list_environments()}


 @router.get("/v1/environments/{name}")
-async def get_environment(name: str):
+async def get_environment(name: str, _auth=Depends(_require_api_auth)):
    """Get an environment profile including its reference images as base64."""
    meta = _load_environment_meta(name)
    if not meta:
@@ -316,7 +341,7 @@ async def get_environment(name: str):


 @router.delete("/v1/environments/{name}")
-async def delete_environment(name: str):
+async def delete_environment(name: str, _auth=Depends(_require_api_auth)):
    """Delete an environment profile."""
    edir = _env_dir(name)
    if not os.path.isdir(edir):
@@ -327,7 +352,7 @@ async def delete_environment(name: str):


 @router.patch("/v1/environments/{name}")
-async def patch_environment(name: str, req: EnvironmentPatchRequest):
+async def patch_environment(name: str, req: EnvironmentPatchRequest, _auth=Depends(_require_api_auth)):
    """Update an environment profile: description, add images, or remove images by index."""
    meta = _load_environment_meta(name)
    if not meta:
@@ -398,28 +423,24 @@ async def generate_environment(req: EnvironmentGenerateRequest, request: Request
    if req.steps:
        payload["steps"] = req.steps

-    auth_header = request.headers.get("authorization", "")
-    headers = {"Content-Type": "application/json"}
-    if auth_header:
-        headers["Authorization"] = auth_header
-
    try:
-        from httpx import AsyncClient, ASGITransport
-        async with AsyncClient(
-            transport=ASGITransport(app=request.app),
-            base_url="http://internal",
-            timeout=300,
-        ) as client:
-            r = await client.post("/v1/images/generations", json=payload, headers=headers)
-
-        if not r.is_success:
+        import json as _json
+        from codai.broker.asgi_bridge import execute_internal_request
+        resp = await execute_internal_request(
+            request.app,
+            method="POST",
+            path="/v1/images/generations",
+            headers={"Content-Type": "application/json"},
+            body=_json.dumps(payload).encode(),
+        )
+        if resp["status_code"] >= 400:
            try:
-                detail = r.json().get("detail", r.text)
+                detail = _json.loads(resp["body"]).get("detail", resp["body"].decode())
            except Exception:
-                detail = r.text
-            raise HTTPException(status_code=r.status_code, detail=f"Image generation failed: {detail}")
+                detail = resp["body"].decode()
+            raise HTTPException(status_code=resp["status_code"], detail=f"Image generation failed: {detail}")

-        images_data = r.json().get("data", [])
+        images_data = _json.loads(resp["body"]).get("data", [])
    except HTTPException:
        raise
    except Exception as e:

--- a/codai/api/text.py
+++ b/codai/api/text.py
@@ -283,25 +283,69 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
    # Continue with original implementation for 'auto' parser
    # Get the model for this request
    requested_model = request.model
-    
-    # Use the manager to resolve the model and manage VRAM (handles ondemand unloading)
-    model_info = multi_model_manager.request_model(
-        requested_model=requested_model,
-        model_type="text"
-    )
-    
-    # Check if the model was rejected as not allowed
-    if model_info.get('error'):
-        raise HTTPException(status_code=404, detail=model_info['error'])
-    
-    # Acquire the least-busy instance (increments ref-count; released on response completion)
-    _model_key = model_info.get('model_key')
+
+    # Resolve and load the model, waiting if another model is currently loading.
+    # Retries up to ~5 minutes (60 × 5s) so requests queue behind long video loads
+    # rather than failing immediately with "No model loaded".
+    _MAX_WAIT_TRIES = 60
+    _model_key = None
    _instance_idx = None
-    _acq = multi_model_manager.acquire_model_instance(_model_key) if _model_key else None
-    if _acq:
-        _instance_idx, mm = _acq
-    else:
-        mm = multi_model_manager.get_model_for_request(requested_model)
+    mm = None
+    model_info = {}
+
+    for _attempt in range(_MAX_WAIT_TRIES):
+        # Fail fast on a corrupted CUDA context — retrying 60× is pointless.
+        if getattr(multi_model_manager, 'cuda_context_poisoned', False):
+            raise HTTPException(status_code=503, detail=(
+                "CUDA context corrupted by an earlier device-side assert "
+                f"({multi_model_manager.cuda_poison_reason}). Restart coderai to recover."))
+
+        # If another model is loading, yield the event loop and wait for it to finish.
+        if not multi_model_manager._model_ready_event.is_set():
+            print(f"Text model '{requested_model}': waiting for model load to complete "
+                  f"(attempt {_attempt + 1}/{_MAX_WAIT_TRIES})…")
+            await asyncio.to_thread(
+                multi_model_manager._model_ready_event.wait, 30.0
+            )
+            await asyncio.sleep(0)
+
+        # In a thread: request_model may block waiting for a busy model to go
+        # idle before evicting it; blocking the event loop here would deadlock.
+        model_info = await asyncio.to_thread(
+            multi_model_manager.request_model,
+            requested_model,
+            "text",
+        )
+        if model_info.get('error'):
+            # CUDA-poison errors are unrecoverable → 503; others (unknown model) → 404.
+            _status = 503 if 'CUDA context corrupted' in str(model_info['error']) else 404
+            raise HTTPException(status_code=_status, detail=model_info['error'])
+
+        _model_key = model_info.get('model_key')
+        _candidate = None
+        _acq = multi_model_manager.acquire_model_instance(_model_key) if _model_key else None
+        if _acq:
+            _instance_idx, _candidate = _acq
+            # Guard against stale pool entries (model evicted but pool not cleared)
+            if hasattr(_candidate, 'backend') and _candidate.backend is None:
+                multi_model_manager.release_model_instance(_model_key, _instance_idx)
+                _instance_idx = None
+                _candidate = None
+        if _candidate is None:
+            _candidate = multi_model_manager.get_model_for_request(requested_model)
+        if _candidate is None and model_manager.backend is not None:
+            _candidate = model_manager
+        # Validate the candidate has a working backend before accepting it
+        if _candidate is not None:
+            if hasattr(_candidate, 'backend') and _candidate.backend is None:
+                _candidate = None
+        if _candidate is not None:
+            mm = _candidate
+            break
+
+        print(f"Text model '{requested_model}' not ready, retrying in 5s "
+              f"(attempt {_attempt + 1}/{_MAX_WAIT_TRIES})…")
+        await asyncio.sleep(5)

    def _release_instance():
        if _instance_idx is not None and _model_key:
@@ -309,12 +353,10 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request

    if mm is None:
        _release_instance()
-        if model_manager.backend is not None:
-            current_manager = model_manager
-        else:
-            raise HTTPException(status_code=503, detail="Model not loaded")
-    else:
-        current_manager = mm
+        raise HTTPException(status_code=503,
+                            detail=f"Model '{requested_model}' could not be loaded after waiting. "
+                                   "Another model may be using all available VRAM.")
+    current_manager = mm

    # Inject system prompt if --system-prompt flag was provided
    messages = request.messages
@@ -1161,6 +1203,7 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
                    tool_parser,
                    request.response_format,
                    _prefix_key,
+                    enable_thinking=reasoning_enabled,
                ):
                    yield chunk
            finally:
@@ -1182,6 +1225,7 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
                tool_parser,
                request.response_format,
                force_reasoning_args,
+                enable_thinking=reasoning_enabled,
            )
        finally:
            _release_instance()
@@ -1198,6 +1242,7 @@ async def stream_chat_response(
    tool_parser: ToolCallParser,
    response_format: Optional[Dict] = None,
    prefix_key: str = "",
+    enable_thinking: bool = False,
 ) -> AsyncGenerator[str, None]:
    """Stream chat completion response with queue notifications."""
    completion_id = f"chatcmpl-{uuid.uuid4().hex}"
@@ -1327,6 +1372,7 @@ async def stream_chat_response(
            stop=stop,
            tools=tools,
            response_format=response_format,
+            enable_thinking=enable_thinking,
        ):
            chunk_count += 1
            # Always filter malformed content (regex-based, works per-chunk)
@@ -1547,6 +1593,7 @@ async def generate_chat_response(
    tool_parser: ToolCallParser,
    response_format: Optional[Dict] = None,
    force_reasoning_args: Optional[List[str]] = None,
+    enable_thinking: bool = False,
 ) -> Dict:
    """Generate non-streaming chat completion response."""
    completion_id = f"chatcmpl-{uuid.uuid4().hex}"
@@ -1583,6 +1630,7 @@ async def generate_chat_response(
            stop=stop,
            tools=tools,
            response_format=response_format,
+            enable_thinking=enable_thinking,
        )
        
        # Always filter out malformed content
@@ -1748,9 +1796,12 @@ async def completions(request: CompletionRequest):
    requested_model = request.model
    
    # Use the manager to resolve the model and manage VRAM (handles ondemand unloading)
-    model_info = multi_model_manager.request_model(
+    # In a thread: request_model may block (thermal cooldown / waiting for a busy
+    # model) and we must not stall the event loop.
+    model_info = await asyncio.to_thread(
+        multi_model_manager.request_model,
        requested_model=requested_model,
-        model_type="text"
+        model_type="text",
    )
    
    # Check if the model was rejected as not allowed

--- a/codai/api/transcriptions.py
+++ b/codai/api/transcriptions.py
@@ -18,6 +18,7 @@
 Audio transcription endpoint for the codai API.
 """

+import asyncio
 import io
 import os
 import tempfile
@@ -143,7 +144,8 @@ async def create_transcription(
        else multi_model_manager.whisper_servers.get(model)
    )
    if whisper_server is not None:
-        multi_model_manager.request_model(requested_model=model, model_type="audio")
+        await asyncio.to_thread(
+            multi_model_manager.request_model, requested_model=model, model_type="audio")
        if not whisper_server.is_running():
            whisper_server.start(
                getattr(whisper_server, "_model_path", None),
@@ -166,7 +168,8 @@ async def create_transcription(
        return _format_response(response_format, result.get("text", ""), [])

    # Use the manager to resolve the model and manage VRAM
-    model_info = multi_model_manager.request_model(
+    model_info = await asyncio.to_thread(
+        multi_model_manager.request_model,
        requested_model=model,
        model_type="audio"
    )

--- a/codai/api/tts.py
+++ b/codai/api/tts.py
@@ -99,9 +99,10 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
        return {"audio": audio_base64}

    # Use the manager to resolve the model and manage VRAM
-    model_info = multi_model_manager.request_model(
+    model_info = await asyncio.to_thread(
+        multi_model_manager.request_model,
        requested_model=request.model,
-        model_type="tts"
+        model_type="tts",
    )
    
    # Check if the model was rejected as not allowed

--- a/codai/backends/cuda.py
+++ b/codai/backends/cuda.py
@@ -44,6 +44,31 @@ except (ImportError, AttributeError):
    _grammar_guided_gen = False


+def _make_thermal_criteria():
+    """A StoppingCriteria that pauses generation while the CPU/GPU is too hot.
+
+    It runs ON the generation thread (between token forward passes), so blocking
+    here actually pauses GPU work — unlike the streamer consumer loop, which is
+    decoupled. Returns False so it never ends generation; throttled so it doesn't
+    read sensors on every token. Returns None if transformers is unavailable.
+    """
+    try:
+        from transformers import StoppingCriteria
+    except Exception:
+        return None
+
+    class _ThermalPause(StoppingCriteria):
+        def __call__(self, input_ids, scores, **kwargs):
+            try:
+                from codai.models.thermal import checkpoint
+                checkpoint(context="text-gen", throttle_seconds=2.0)
+            except Exception:
+                pass
+            return False
+
+    return _ThermalPause()
+
+
 class NvidiaBackend(ModelBackend):
    """Backend for NVIDIA GPUs using HuggingFace Transformers."""
    
@@ -201,6 +226,36 @@ class NvidiaBackend(ModelBackend):
                        raise e
            raise
    
+    def _make_bnb_config(self, model_name: str, load_in_4bit: bool, load_in_8bit: bool):
+        """Build a transformers BitsAndBytesConfig (the modern quant API).
+
+        Passing load_in_4bit/load_in_8bit as direct from_pretrained kwargs is
+        removed in recent transformers and raises TypeError — which previously
+        forced a silent fallback to FULL-PRECISION loading (the model then no
+        longer fit on the GPU, offloaded to CPU, and leaked VRAM on eviction).
+        Always go through quantization_config instead.
+        """
+        ml = model_name.lower()
+        if 'qwen3.5' in ml and ('a3b' in ml or 'moe' in ml):
+            print(f"Warning: {model_name} does not support bitsandbytes quantization")
+            return None
+        try:
+            import bitsandbytes as bnb  # noqa: F401
+            import torch
+            from transformers import BitsAndBytesConfig
+        except ImportError:
+            print("Warning: bitsandbytes not installed. Quantization disabled.")
+            return None
+        print(f"Using {4 if load_in_4bit else 8}-bit quantization")
+        if load_in_4bit:
+            return BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_quant_type='nf4',
+                bnb_4bit_compute_dtype=torch.float16,
+                bnb_4bit_use_double_quant=True,
+            )
+        return BitsAndBytesConfig(load_in_8bit=True)
+
    def _is_moe_model(self, model_name: str) -> bool:
        """Check if model is a MoE model."""
        moe_indicators = ['moe', 'mixtral', 'qwen3_5_moe', 'qwen3.5_moe', 'expert', 'a3b']
@@ -317,7 +372,8 @@ class NvidiaBackend(ModelBackend):
        flash_attn = kwargs.get('flash_attn', False)
        offload_strategy = kwargs.get('offload_strategy', 'auto')
        max_gpu_percent = kwargs.get('max_gpu_percent', None)
-        
+        expected_vram_gb = kwargs.get('expected_vram_gb') or 0
+
        # Check for --no-ram mode
        no_ram = kwargs.get('no_ram', False)
        if not no_ram:
@@ -328,12 +384,37 @@ class NvidiaBackend(ModelBackend):
                    no_ram = True
            except Exception:
                pass
-        
+
        self._pending_ram_gb = manual_ram_gb
-        
+
        print(f"Loading HuggingFace model: {model_name}")
-        
-        self.use_flash_attn = flash_attn
+
+        # Flash-Attention-2 requires the ENTIRE model resident on a single CUDA
+        # device.  If the model will be split across GPU+CPU (offloading), FA2
+        # triggers a device-side assert that corrupts the whole CUDA context.
+        # So FA2 is only safe when the model fits fully in free GPU VRAM, or the
+        # user forced full-GPU residence (no_ram / offload_strategy='none').
+        self._fa2_safe = True
+        if flash_attn:
+            _full_gpu_forced = no_ram or offload_strategy == 'none'
+            if not _full_gpu_forced:
+                try:
+                    import torch as _t
+                    if _t.cuda.is_available() and expected_vram_gb > 0:
+                        _free, _ = _t.cuda.mem_get_info(0)
+                        _free_gb = _free / 1e9
+                        # expected_vram_gb already includes ~15% overhead; the
+                        # model must fit entirely on GPU for FA2 to be safe.
+                        if expected_vram_gb > _free_gb:
+                            self._fa2_safe = False
+                            print(f"  Flash Attention 2 disabled: model needs "
+                                  f"~{expected_vram_gb:.1f} GB but only {_free_gb:.1f} GB "
+                                  f"GPU free → will offload to CPU (FA2 needs full-GPU "
+                                  f"residence). Using SDPA instead.")
+                except Exception:
+                    pass
+
+        self.use_flash_attn = flash_attn and self._fa2_safe
        self.check_flash_attn_support()
        
        self.device = self._detect_device()
@@ -368,16 +449,9 @@ class NvidiaBackend(ModelBackend):
            
            # Still allow quantization in no-ram mode (reduces VRAM usage)
            if load_in_4bit or load_in_8bit:
-                if 'qwen3.5' in model_name.lower() and ('a3b' in model_name.lower() or 'moe' in model_name.lower()):
-                    print(f"  Warning: {model_name} does not support bitsandbytes quantization")
-                else:
-                    try:
-                        import bitsandbytes as bnb
-                        print(f"  Using {4 if load_in_4bit else 8}-bit quantization")
-                        load_kwargs['load_in_4bit'] = load_in_4bit
-                        load_kwargs['load_in_8bit'] = load_in_8bit
-                    except ImportError:
-                        print("  Warning: bitsandbytes not installed. Quantization disabled.")
+                _qc = self._make_bnb_config(model_name, load_in_4bit, load_in_8bit)
+                if _qc is not None:
+                    load_kwargs['quantization_config'] = _qc
            
            try:
                model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
@@ -404,17 +478,10 @@ class NvidiaBackend(ModelBackend):
        load_kwargs = {'trust_remote_code': True}
        
        if load_in_4bit or load_in_8bit:
-            if 'qwen3.5' in model_name.lower() and ('a3b' in model_name.lower() or 'moe' in model_name.lower()):
-                print(f"Warning: {model_name} does not support bitsandbytes quantization")
-            else:
-                try:
-                    import bitsandbytes as bnb
-                    print(f"Using {4 if load_in_4bit else 8}-bit quantization")
-                    load_kwargs['load_in_4bit'] = load_in_4bit
-                    load_kwargs['load_in_8bit'] = load_in_8bit
-                except ImportError:
-                    print("Warning: bitsandbytes not installed. Quantization disabled.")
-        
+            _qc = self._make_bnb_config(model_name, load_in_4bit, load_in_8bit)
+            if _qc is not None:
+                load_kwargs['quantization_config'] = _qc
+
        if self.device == "cuda":
            load_kwargs['dtype'] = torch.float16
        else:
@@ -427,7 +494,12 @@ class NvidiaBackend(ModelBackend):
        if self.use_flash_attn and self.flash_attn_available:
            load_kwargs['attn_implementation'] = "flash_attention_2"
            print("Using Flash Attention 2")
-        
+        else:
+            # SDPA safely handles GPU+CPU split models and still uses flash
+            # kernels for the GPU-resident layers — the safe default when the
+            # model is offloaded (FA2 would device-side-assert here).
+            load_kwargs['attn_implementation'] = "sdpa"
+
        model = None
        vram_percentages = self._get_vram_percentages_for_gpu(model_name, offload_strategy, max_gpu_percent)
        
@@ -450,40 +522,86 @@ class NvidiaBackend(ModelBackend):
                )
        else:
            first_vram_pct = vram_percentages[0] if vram_percentages else 0.93
-            
+
            for vram_pct in vram_percentages:
                if self.device != "cuda":
-                    load_kwargs['device_map'] = None
-                    print("Loading model in CPU-only mode...")
-                    model = self._try_load_model(model_name, load_kwargs, self.device)
-                    if model is not None:
-                        break
-                
+                    # No CUDA device — go straight to CPU+disk loading below.
+                    break
+
                if vram_pct > 0:
+                    # Build max_memory: GPU budget capped at actual FREE VRAM so
+                    # we never try to allocate more than what's physically available.
+                    # Excess layers overflow to CPU RAM automatically via device_map.
                    max_memory = self._get_gpu_memory_map_with_limit(vram_pct)
                    load_kwargs['max_memory'] = max_memory
                    load_kwargs['device_map'] = 'auto'
-                    print(f"\nTrying with GPU limit: {vram_pct*100:.0f}% VRAM")
-                    
+                    _gpu_gb = max_memory.get(0, 0) / 1e9
+                    _cpu_gb = max_memory.get('cpu', 0) / 1e9
+                    print(f"\nTrying GPU {_gpu_gb:.1f} GB + CPU {_cpu_gb:.1f} GB"
+                          f" (device_map=auto, {vram_pct*100:.0f}% VRAM cap)")
+
                    model = self._try_load_model(model_name, load_kwargs, self.device)
-                    
+
                    if model is not None:
-                        print(f"  ✓ Model loaded successfully with {vram_pct*100:.0f}% GPU VRAM limit")
+                        print(f"  ✓ Model loaded — GPU {_gpu_gb:.1f} GB / CPU {_cpu_gb:.1f} GB")
                        if vram_pct < first_vram_pct:
-                            print(f"  (Reduced from {first_vram_pct*100:.0f}% due to memory constraints)")
+                            print(f"  (Reduced GPU cap from {first_vram_pct*100:.0f}%"
+                                  f" due to memory constraints)")
                        break
                    else:
-                        print(f"  ✗ Out of memory with {vram_pct*100:.0f}% GPU VRAM, trying lower limit...")
+                        print(f"  ✗ OOM at GPU {_gpu_gb:.1f} GB, trying lower GPU cap…")
                        if torch.cuda.is_available():
                            torch.cuda.empty_cache()
                else:
-                    print("\nFalling back to CPU-only mode...")
-                    load_kwargs['max_memory'] = {0: 0, 'cpu': int((manual_ram_gb or 48) * 1e9)}
-                    load_kwargs['device_map'] = 'auto'
-                    model = self._try_load_model(model_name, load_kwargs, "cpu")
+                    # vram_pct == 0: GPU (all free VRAM) + CPU RAM + disk overflow.
+                    # Use every byte of GPU that's free, then spill to CPU RAM, then
+                    # disk — NEVER leave GPU idle when loading this fallback level.
+                    import psutil as _psutil
+                    _free_vram = 0
+                    if torch.cuda.is_available():
+                        try:
+                            _free_vram, _ = torch.cuda.mem_get_info(0)
+                        except Exception:
+                            pass
+                    _headroom = 512 * 1024 * 1024
+                    _gpu_budget = max(0, _free_vram - _headroom)
+                    _free_ram = _psutil.virtual_memory().available
+                    _cpu_budget = max(int(2e9), int(_free_ram * 0.80))
+                    _disk_dir = offload_dir or os.path.join(
+                        os.path.expanduser('~'), '.cache', 'coderai', 'offload')
+                    os.makedirs(_disk_dir, exist_ok=True)
+                    print(f"\nGPU {_gpu_budget/1e9:.1f} GB + CPU {_cpu_budget/1e9:.1f} GB"
+                          f" + disk ({_disk_dir})")
+                    _spill_kwargs = {
+                        **load_kwargs,
+                        'device_map': 'auto',
+                        'max_memory': {0: _gpu_budget, 'cpu': _cpu_budget},
+                        'offload_folder': _disk_dir,
+                        'offload_buffers': True,
+                    }
+                    model = self._try_load_model(model_name, _spill_kwargs, self.device)
                    if model is not None:
-                        print("  ✓ Model loaded successfully on CPU")
+                        print(f"  ✓ Model loaded — GPU {_gpu_budget/1e9:.1f} GB"
+                              f" / CPU {_cpu_budget/1e9:.1f} GB / disk overflow")
                        break
+
+            # Absolute last resort: pure CPU without device_map.
+            # Only reached when CUDA is unavailable or all GPU+RAM+disk paths failed.
+            # Uses device_map=None to avoid accelerate hooks that assume CUDA.
+            if model is None:
+                print("\nFalling back to pure CPU (no GPU available)…")
+                cpu_kwargs = {
+                    'trust_remote_code': True,
+                    'torch_dtype': torch.float32,
+                    'low_cpu_mem_usage': True,
+                }
+                if offload_dir:
+                    cpu_kwargs['offload_folder'] = offload_dir
+                if self.use_flash_attn and self.flash_attn_available:
+                    cpu_kwargs['attn_implementation'] = "flash_attention_2"
+                model = self._try_load_model(model_name, cpu_kwargs, "cpu")
+                if model is not None:
+                    print("  ✓ Model loaded on CPU (no GPU)")
        
        if model is None:
            raise RuntimeError("Failed to load model: Out of memory even with minimum GPU usage")
@@ -499,17 +617,29 @@ class NvidiaBackend(ModelBackend):
        print(f"Model capabilities: {caps}")
    
    def _get_gpu_memory_map_with_limit(self, vram_fraction: float) -> Dict:
-        """Get max_memory dict with specified VRAM fraction limit."""
+        """Get max_memory dict for device_map='auto'.
+
+        GPU budget = min(total × fraction, free − 512 MB headroom).
+        Capping at free VRAM ensures we never ask accelerate to allocate more
+        than what's physically available; layers that exceed the GPU budget
+        spill to CPU RAM automatically via device_map.
+        """
        import torch
        max_memory = {}
-        
+
        if torch.cuda.is_available():
            for i in range(torch.cuda.device_count()):
                props = torch.cuda.get_device_properties(i)
                total_vram = props.total_memory
-                usable_vram = int(total_vram * vram_fraction)
-                max_memory[i] = usable_vram
-        
+                try:
+                    free_vram, _ = torch.cuda.mem_get_info(i)
+                except Exception:
+                    free_vram = total_vram
+                headroom = 512 * 1024 * 1024  # 512 MB for CUDA driver overhead
+                limit_by_fraction = int(total_vram * vram_fraction)
+                limit_by_free     = max(0, free_vram - headroom)
+                max_memory[i] = min(limit_by_fraction, limit_by_free)
+
        manual_ram_gb = getattr(self, '_pending_ram_gb', None)
        if manual_ram_gb:
            max_memory['cpu'] = int(manual_ram_gb * 1e9)
@@ -518,7 +648,7 @@ class NvidiaBackend(ModelBackend):
            available_ram = psutil.virtual_memory().available
            usable_ram = max(0, available_ram - int(4e9))
            max_memory['cpu'] = usable_ram
-        
+
        return max_memory
    
    def format_messages(self, messages: List[ChatMessage]) -> str:
@@ -835,19 +965,24 @@ class NvidiaBackend(ModelBackend):
        if repeat_penalty != 1.0:
            generation_kwargs["repetition_penalty"] = repeat_penalty
        
+        # Mid-generation thermal checkpoint (runs on the generate thread).
+        _criteria = []
+        _therm = _make_thermal_criteria()
+        if _therm is not None:
+            _criteria.append(_therm)
        if stop:
            class StopOnSequence(StoppingCriteria):
                def __init__(self, stop_sequences, tokenizer):
                    self.stop_sequences = stop_sequences
                    self.tokenizer = tokenizer
-                
+
                def __call__(self, input_ids, scores, **kwargs):
                    decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True)
                    return any(seq in decoded for seq in self.stop_sequences)
-            
-            generation_kwargs["stopping_criteria"] = StoppingCriteriaList([
-                StopOnSequence(stop, self.tokenizer)
-            ])
+
+            _criteria.append(StopOnSequence(stop, self.tokenizer))
+        if _criteria:
+            generation_kwargs["stopping_criteria"] = StoppingCriteriaList(_criteria)
        
        generation_error = None
        
@@ -890,9 +1025,19 @@ class NvidiaBackend(ModelBackend):
            _time.time() - self._kv_timestamp < self._kv_ttl
        )

+    def _model_on_cuda(self) -> bool:
+        """Return True only when the model's first parameter is actually on a CUDA device."""
+        try:
+            return next(self.model.parameters()).is_cuda
+        except StopIteration:
+            return False
+
    def _build_kv_prefix(self, prefix_text: str):
        """Forward-pass on prefix_text to populate the KV state."""
        import torch
+        # KV prefix caching requires CUDA tensors; skip on CPU-mode models.
+        if not self._model_on_cuda():
+            raise RuntimeError("KV prefix cache requires CUDA; model is on CPU")
        inputs = self.tokenizer(
            prefix_text, return_tensors="pt", add_special_tokens=False
        )
@@ -910,6 +1055,8 @@ class NvidiaBackend(ModelBackend):
    def invalidate_kv_cache(self) -> None:
        """Discard the cached KV state (call on model unload/swap)."""
        self._kv_prefix_text = None
+        if self._kv_past_key_values is not None:
+            del self._kv_past_key_values
        self._kv_past_key_values = None
        self._kv_prefix_len = 0
        self._kv_timestamp = 0.0
@@ -934,8 +1081,67 @@ class NvidiaBackend(ModelBackend):
        ]
        return self.format_messages(chat_msgs)

+    def _eos_token_ids(self):
+        """All token ids that should END generation — including the chat turn
+        boundary.  Qwen's turn ends with <|im_end|>, but tokenizer.eos_token_id is
+        <|endoftext|>; without im_end the model never stops and hallucinates extra
+        'assistant'/'user' turns.  Returns a list (HF generate accepts a list)."""
+        ids = set()
+        try:
+            if self.tokenizer.eos_token_id is not None:
+                ids.add(int(self.tokenizer.eos_token_id))
+        except Exception:
+            pass
+        for tok in ('<|im_end|>', '<|eot_id|>', '<|end|>', '<|endoftext|>',
+                    '<|end_of_text|>', '<end_of_turn>'):
+            try:
+                tid = self.tokenizer.convert_tokens_to_ids(tok)
+                if isinstance(tid, int) and tid >= 0 and tid != getattr(
+                        self.tokenizer, 'unk_token_id', None):
+                    ids.add(tid)
+            except Exception:
+                pass
+        return list(ids) if ids else self.tokenizer.eos_token_id
+
+    def _build_chat_prompt(self, messages, enable_thinking: bool = False,
+                           add_generation_prompt: bool = True) -> str:
+        """Build the prompt string using the MODEL's own chat template when it has
+        one (correct special tokens + proper `enable_thinking` handling for Qwen3).
+        Falls back to the legacy custom formatter when no template is available.
+
+        `enable_thinking=True` keeps reasoning <think> blocks available for callers
+        that ask for them; `False` (default) suppresses them via the template.
+        """
+        tmpl = getattr(self.tokenizer, 'chat_template', None)
+        if tmpl:
+            # Normalise to plain {role, content} dicts for apply_chat_template.
+            norm = []
+            for m in messages:
+                if isinstance(m, dict):
+                    norm.append({'role': m.get('role'), 'content': m.get('content') or ''})
+                else:
+                    norm.append({'role': getattr(m, 'role', None),
+                                 'content': getattr(m, 'content', '') or ''})
+            try:
+                return self.tokenizer.apply_chat_template(
+                    norm, tokenize=False,
+                    add_generation_prompt=add_generation_prompt,
+                    enable_thinking=enable_thinking)
+            except TypeError:
+                # Tokenizer's template doesn't accept enable_thinking — use plain.
+                try:
+                    return self.tokenizer.apply_chat_template(
+                        norm, tokenize=False,
+                        add_generation_prompt=add_generation_prompt)
+                except Exception:
+                    pass
+            except Exception:
+                pass
+        return self._format_messages_to_str(messages)
+
    def generate_chat(self, messages, max_tokens=None, temperature=0.7,
-                      top_p=1.0, stop=None, tools=None, response_format=None) -> str:
+                      top_p=1.0, stop=None, tools=None, response_format=None,
+                      enable_thinking=False) -> str:
        """
        Non-streaming chat generation with KV prefix caching.

@@ -947,7 +1153,8 @@ class NvidiaBackend(ModelBackend):
        if max_tokens is None:
            max_tokens = 512

-        full_prompt = self._format_messages_to_str(messages)
+        full_prompt = self._build_chat_prompt(messages, enable_thinking=enable_thinking,
+                                              add_generation_prompt=True)
        total_input_ids = self.tokenizer(full_prompt, return_tensors="pt")['input_ids']
        total_prompt_len = int(total_input_ids.shape[1])

@@ -961,8 +1168,9 @@ class NvidiaBackend(ModelBackend):
        past_kv = None
        cached_len = 0

-        if prefix_msgs:
-            prefix_text = self._format_messages_to_str(prefix_msgs)
+        if prefix_msgs and self._model_on_cuda():
+            prefix_text = self._build_chat_prompt(
+                prefix_msgs, enable_thinking=enable_thinking, add_generation_prompt=False)
            if self._kv_cache_valid() and self._kv_prefix_text == prefix_text:
                past_kv = self._kv_past_key_values
                cached_len = self._kv_prefix_len
@@ -981,9 +1189,15 @@ class NvidiaBackend(ModelBackend):
            top_p=top_p if do_sample else None,
            do_sample=do_sample,
            pad_token_id=self.tokenizer.pad_token_id,
-            eos_token_id=self.tokenizer.eos_token_id,
+            eos_token_id=self._eos_token_ids(),
            use_cache=True,
        )
+        # Mid-generation thermal checkpoint (runs on the generate thread, so it
+        # pauses GPU work between tokens when the CPU/GPU is too hot).
+        _therm = _make_thermal_criteria()
+        if _therm is not None:
+            from transformers import StoppingCriteriaList
+            gen_kwargs["stopping_criteria"] = StoppingCriteriaList([_therm])

        generated_text = ""
        try:
@@ -1014,7 +1228,12 @@ class NvidiaBackend(ModelBackend):
            generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
        except Exception as e:
            print(f"Warning: KV-cached generate_chat failed ({e}), retrying without cache")
+            self.invalidate_kv_cache()
            cached_len = 0
+            # Determine if the error is a CUDA device-placement issue; if so, also
+            # disable the internal KV cache which accumulates mixed-device tensors.
+            _is_device_error = "is_cuda" in str(e) or "device" in str(e).lower()
+            _fallback_kwargs = {**gen_kwargs, 'use_cache': not _is_device_error}
            try:
                total_input_ids = self.tokenizer(
                    full_prompt, return_tensors="pt"
@@ -1024,13 +1243,34 @@ class NvidiaBackend(ModelBackend):
                    outputs = self.model.generate(
                        input_ids=total_input_ids,
                        attention_mask=attn_mask,
-                        **gen_kwargs,
+                        **_fallback_kwargs,
                    )
                new_tokens = outputs[0][total_prompt_len:]
                generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
            except Exception as e2:
                print(f"Error: generate_chat fallback failed: {e2}")
-                generated_text = ""
+                # Last resort: disable internal KV cache entirely
+                if _fallback_kwargs.get('use_cache', True):
+                    try:
+                        no_cache_kwargs = {**gen_kwargs, 'use_cache': False}
+                        total_input_ids = self.tokenizer(
+                            full_prompt, return_tensors="pt"
+                        )['input_ids'].to(self.model.device)
+                        attn_mask = torch.ones_like(total_input_ids)
+                        with torch.no_grad():
+                            outputs = self.model.generate(
+                                input_ids=total_input_ids,
+                                attention_mask=attn_mask,
+                                **no_cache_kwargs,
+                            )
+                        new_tokens = outputs[0][total_prompt_len:]
+                        generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
+                        print("generate_chat: recovered with use_cache=False")
+                    except Exception as e3:
+                        print(f"Error: generate_chat no-cache fallback failed: {e3}")
+                        generated_text = ""
+                else:
+                    generated_text = ""

        try:
            comp_len = len(self.tokenizer.encode(generated_text)) if generated_text else 0
@@ -1046,7 +1286,8 @@ class NvidiaBackend(ModelBackend):

    async def generate_chat_stream(self, messages, max_tokens=None,
                                   temperature=0.7, top_p=1.0, stop=None,
-                                   tools=None, response_format=None):
+                                   tools=None, response_format=None,
+                                   enable_thinking=False):
        """
        Streaming chat generation with KV prefix caching.
        Uses the same prefix-cache strategy as generate_chat.
@@ -1058,7 +1299,8 @@ class NvidiaBackend(ModelBackend):
        if max_tokens is None:
            max_tokens = 512

-        full_prompt = self._format_messages_to_str(messages)
+        full_prompt = self._build_chat_prompt(messages, enable_thinking=enable_thinking,
+                                              add_generation_prompt=True)
        total_input_ids = self.tokenizer(full_prompt, return_tensors="pt")['input_ids']
        total_prompt_len = int(total_input_ids.shape[1])

@@ -1070,8 +1312,9 @@ class NvidiaBackend(ModelBackend):
        past_kv = None
        cached_len = 0

-        if prefix_msgs:
-            prefix_text = self._format_messages_to_str(prefix_msgs)
+        if prefix_msgs and self._model_on_cuda():
+            prefix_text = self._build_chat_prompt(
+                prefix_msgs, enable_thinking=enable_thinking, add_generation_prompt=False)
            if self._kv_cache_valid() and self._kv_prefix_text == prefix_text:
                past_kv = self._kv_past_key_values
                cached_len = self._kv_prefix_len
@@ -1109,11 +1352,16 @@ class NvidiaBackend(ModelBackend):
            do_sample=do_sample,
            streamer=streamer,
            pad_token_id=self.tokenizer.pad_token_id,
-            eos_token_id=self.tokenizer.eos_token_id,
+            eos_token_id=self._eos_token_ids(),
            use_cache=True,
            **extra_gen,
        )

+        # Mid-generation thermal checkpoint (runs on the generate thread).
+        _criteria = []
+        _therm = _make_thermal_criteria()
+        if _therm is not None:
+            _criteria.append(_therm)
        if stop:
            class _StopOnSeq(StoppingCriteria):
                def __init__(self, seqs, tok):
@@ -1122,9 +1370,9 @@ class NvidiaBackend(ModelBackend):
                def __call__(self, input_ids, scores, **kw):
                    decoded = self.tok.decode(input_ids[0][-20:], skip_special_tokens=True)
                    return any(s in decoded for s in self.seqs)
-            gen_kwargs['stopping_criteria'] = StoppingCriteriaList(
-                [_StopOnSeq(stop, self.tokenizer)]
-            )
+            _criteria.append(_StopOnSeq(stop, self.tokenizer))
+        if _criteria:
+            gen_kwargs['stopping_criteria'] = StoppingCriteriaList(_criteria)

        gen_error = [None]
        comp_tokens = [0]
@@ -1155,21 +1403,206 @@ class NvidiaBackend(ModelBackend):

        if gen_error[0]:
            print(f"Warning: KV-cached stream generation error: {gen_error[0]}")
+            self.invalidate_kv_cache()

    def get_model_name(self) -> str:
        return self.model_name or "unknown"
    
    def cleanup(self) -> None:
-        import torch
+        import torch, gc
+        try:
+            from codai.api.state import get_global_debug
+            _dbg = bool(get_global_debug())
+        except Exception:
+            _dbg = False
+
+        def _vram_gb():
+            try:
+                if torch.cuda.is_available():
+                    free, total = torch.cuda.mem_get_info()
+                    return (total - free) / 1e9
+            except Exception:
+                pass
+            return -1.0
+
+        def _cuda_param_gb():
+            tot = 0
+            try:
+                for p in self.model.parameters():
+                    if p.data.is_cuda:
+                        tot += p.data.numel() * p.data.element_size()
+                for b in self.model.buffers():
+                    if b.data.is_cuda:
+                        tot += b.data.numel() * b.data.element_size()
+            except Exception:
+                pass
+            return tot / 1e9
+
+        _v0 = _vram_gb()
        self.invalidate_kv_cache()
        if self.model is not None:
+            _pg0 = _cuda_param_gb() if _dbg else 0.0
+            # Record the GPU storage pointers of THIS model's tensors so we can,
+            # after moving them to CPU, break any lingering external references
+            # (e.g. accelerate's tied_params_map, which keeps tied embedding /
+            # lm_head weights alive on the GPU and fragments the allocator so
+            # empty_cache() can't release the surrounding memory).  Scoped by
+            # data_ptr so we never touch a different (coexisting) model.
+            _orig_cuda_ptrs = set()
+            try:
+                for _p in self.model.parameters():
+                    if _p.data.is_cuda:
+                        _orig_cuda_ptrs.add(_p.data.untyped_storage().data_ptr())
+                for _b in self.model.buffers():
+                    if _b.data.is_cuda:
+                        _orig_cuda_ptrs.add(_b.data.untyped_storage().data_ptr())
+            except Exception:
+                pass
+
+            # Strip accelerate dispatch hooks AND their offload bookkeeping, which
+            # hold references to the original CUDA tensors.  Must happen before we
+            # move tensors, or the hooks keep the GPU copies alive.
+            try:
+                from accelerate.hooks import remove_hook_from_submodules
+                remove_hook_from_submodules(self.model)
+            except Exception:
+                pass
+            # Walk every submodule and move its raw _parameters/_buffers storage to
+            # CPU directly.  This reaches tensors that model.parameters() may skip
+            # (e.g. when wrapped by accelerate) and does NOT rely on model.to('cpu'),
+            # which is a silent no-op on dispatched models.
+            try:
+                import torch as _t
+                for _mod in self.model.modules():
+                    for _d in (_mod._parameters, _mod._buffers):
+                        for _name, _t_obj in list(_d.items()):
+                            if _t_obj is None:
+                                continue
+                            try:
+                                if getattr(_t_obj, 'is_cuda', False):
+                                    _d[_name] = _t_obj.to('cpu')
+                                # accelerate stores params as nn.Parameter; keep type
+                                elif hasattr(_t_obj, 'data') and getattr(_t_obj.data, 'is_cuda', False):
+                                    _t_obj.data = _t_obj.data.to('cpu')
+                            except Exception:
+                                pass
+                    # Drop per-module accelerate hook state that pins CUDA tensors.
+                    for _attr in ('_hf_hook', '_old_forward'):
+                        if hasattr(_mod, _attr):
+                            try:
+                                delattr(_mod, _attr)
+                            except Exception:
+                                pass
+            except Exception as e:
+                print(f"  cleanup: module-walk move issue: {e}")
+            for _attr in ('hf_device_map', '_hf_hook'):
+                try:
+                    if hasattr(self.model, _attr):
+                        delattr(self.model, _attr)
+                except Exception:
+                    pass
+            if _dbg:
+                print(f"  cleanup: CUDA param bytes {_pg0:.1f} → {_cuda_param_gb():.1f} GB")
            del self.model
-            del self.tokenizer
            self.model = None
+
+            # Break lingering references to THIS model's original GPU tensors that
+            # outlive the model (accelerate tied_params_map lists, stray caches).
+            # Only tensors whose storage pointer we recorded above are touched, so
+            # other models loaded alongside are never affected.
+            if _orig_cuda_ptrs:
+                try:
+                    broken = 0
+                    for obj in gc.get_objects():
+                        if not (isinstance(obj, torch.Tensor) and obj.is_cuda):
+                            continue
+                        try:
+                            if obj.untyped_storage().data_ptr() not in _orig_cuda_ptrs:
+                                continue
+                        except Exception:
+                            continue
+                        # Null this tensor out of any list/dict that still holds it.
+                        for ref in gc.get_referrers(obj):
+                            try:
+                                if isinstance(ref, list):
+                                    for i, it in enumerate(ref):
+                                        if it is obj:
+                                            ref[i] = None
+                                            broken += 1
+                                elif isinstance(ref, dict):
+                                    for k, v in list(ref.items()):
+                                        if v is obj:
+                                            ref[k] = None
+                                            broken += 1
+                            except Exception:
+                                pass
+                    if _dbg and broken:
+                        print(f"  cleanup: broke {broken} external GPU-tensor reference(s)")
+                except Exception:
+                    pass
+        if self.tokenizer is not None:
+            del self.tokenizer
            self.tokenizer = None
-            if torch.cuda.is_available():
-                torch.cuda.empty_cache()
-    
+        # Force Python GC before emptying the CUDA allocator pool so that all
+        # Python-held tensor references (closures, local vars, etc.) are dropped.
+        for _ in range(3):
+            gc.collect()
+        if torch.cuda.is_available():
+            torch.cuda.synchronize()
+            torch.cuda.empty_cache()
+            torch.cuda.synchronize()
+        # Release the model's host-side memory back to the OS (and any swap it
+        # was paged into) so RSS doesn't creep up across model swaps.
+        try:
+            from codai.models.manager import _trim_cpu_ram
+            _trim_cpu_ram()
+        except Exception:
+            pass
+        _v1 = _vram_gb()
+        if _v0 >= 0 and _v1 >= 0:
+            print(f"  cleanup: freed {_v0 - _v1:.1f} GB VRAM (now {_v1:.1f} GB used)")
+            if _dbg:
+                try:
+                    _alloc = torch.cuda.memory_allocated() / 1e9
+                    _resv = torch.cuda.memory_reserved() / 1e9
+                    print(f"  cleanup: torch allocated={_alloc:.1f} GB "
+                          f"reserved={_resv:.1f} GB (driver used={_v1:.1f} GB)")
+                except Exception:
+                    pass
+                # If a large chunk is still resident, name what's holding CUDA tensors.
+                if (_v0 - _v1) < 1.0 and _v1 > 2.0:
+                    try:
+                        biggest = []
+                        total = 0.0
+                        seen = set()
+                        for obj in gc.get_objects():
+                            try:
+                                if isinstance(obj, torch.Tensor) and obj.is_cuda:
+                                    if id(obj) in seen:
+                                        continue
+                                    seen.add(id(obj))
+                                    gb = obj.numel() * obj.element_size() / 1e9
+                                    total += gb
+                                    if gb > 0.05:
+                                        rtypes = []
+                                        for r in gc.get_referrers(obj)[:4]:
+                                            rt = type(r).__name__
+                                            if rt == 'dict':
+                                                try:
+                                                    rt = f"dict{list(r.keys())[:3]}"
+                                                except Exception:
+                                                    pass
+                                            rtypes.append(rt)
+                                        biggest.append((gb, tuple(obj.shape), rtypes))
+                            except Exception:
+                                continue
+                        biggest.sort(reverse=True)
+                        print(f"  cleanup-leak: {total:.1f} GB still in CUDA tensors; top holders:")
+                        for gb, shape, rtypes in biggest[:6]:
+                            print(f"    {gb:.2f} GB shape={shape} referrers={rtypes}")
+                    except Exception as e:
+                        print(f"  cleanup-leak scan failed: {e}")
+
    def get_context_size(self) -> int:
        """Return the model's context window size."""
        if self.model is not None and hasattr(self.model, 'config'):

--- a/codai/backends/vulkan.py
+++ b/codai/backends/vulkan.py
@@ -38,6 +38,34 @@ try:
 except (ImportError, AttributeError):
    _grammar_guided_gen = False

+
+def _make_llama_thermal_criteria():
+    """A llama.cpp StoppingCriteriaList that pauses generation while too hot.
+
+    llama-cpp-python evaluates stopping criteria synchronously per token inside
+    create_(chat_)completion, so blocking here pauses the GPU forward pass —
+    mid-generation thermal protection for the GGUF/Vulkan/llama.cpp backend.
+    The criterion never stops generation (returns False) and is throttled so it
+    doesn't read sensors on every token. Returns None if unavailable.
+    """
+    try:
+        from llama_cpp import StoppingCriteriaList
+    except Exception:
+        return None
+
+    def _pause(input_ids, logits):
+        try:
+            from codai.models.thermal import checkpoint
+            checkpoint(context="text-gen", throttle_seconds=2.0)
+        except Exception:
+            pass
+        return False
+
+    try:
+        return StoppingCriteriaList([_pause])
+    except Exception:
+        return None
+
 try:
    from llama_cpp import Llama
    from llama_cpp.llama_chat_format import ChatFormatterResponse
@@ -699,6 +727,7 @@ class VulkanBackend(ModelBackend):
        
        try:
            result = self.model.create_completion(
+                stopping_criteria=_make_llama_thermal_criteria(),
                prompt=prompt,
                max_tokens=max_tokens,
                temperature=temperature,
@@ -717,6 +746,7 @@ class VulkanBackend(ModelBackend):
                print(f"Warning: Grammar-guided generation failed: {e}, falling back to normal generation")
                try:
                    result = self.model.create_completion(
+                        stopping_criteria=_make_llama_thermal_criteria(),
                        prompt=prompt,
                        max_tokens=max_tokens,
                        temperature=temperature,
@@ -803,6 +833,7 @@ class VulkanBackend(ModelBackend):
            prompt_len = len(prompt) if isinstance(prompt, str) else 0
            
            for chunk in self.model.create_completion(
+                stopping_criteria=_make_llama_thermal_criteria(),
                prompt=prompt,
                max_tokens=max_tokens,
                temperature=temperature,
@@ -842,6 +873,7 @@ class VulkanBackend(ModelBackend):
                    prompt_len = len(prompt) if isinstance(prompt, str) else 0
                    
                    for chunk in self.model.create_completion(
+                        stopping_criteria=_make_llama_thermal_criteria(),
                        prompt=prompt,
                        max_tokens=max_tokens,
                        temperature=temperature,
@@ -911,6 +943,7 @@ class VulkanBackend(ModelBackend):
                prompt_len = len(prompt)
                
                for chunk in self.model.create_completion(
+                    stopping_criteria=_make_llama_thermal_criteria(),
                    prompt=prompt,
                    max_tokens=max_tokens,
                    temperature=temperature,
@@ -937,6 +970,7 @@ class VulkanBackend(ModelBackend):
            return {"stream": generate_stream(), "content": ""}
        else:
            result = self.model.create_completion(
+                stopping_criteria=_make_llama_thermal_criteria(),
                prompt=prompt,
                max_tokens=max_tokens,
                temperature=temperature,
@@ -1052,6 +1086,9 @@ class VulkanBackend(ModelBackend):
            kwargs['stop'] = stop
        if response_format and response_format.get('type') == 'json_object':
            kwargs['response_format'] = {'type': 'json_object'}
+        _tc = _make_llama_thermal_criteria()
+        if _tc is not None:
+            kwargs['stopping_criteria'] = _tc

        result = self.model.create_chat_completion(**kwargs)
        usage = result.get('usage', {})
@@ -1077,6 +1114,9 @@ class VulkanBackend(ModelBackend):
        )
        if stop:
            kwargs['stop'] = stop
+        _tc = _make_llama_thermal_criteria()
+        if _tc is not None:
+            kwargs['stopping_criteria'] = _tc

        prompt_tokens = 0
        completion_tokens = 0

--- a/codai/config.py
+++ b/codai/config.py
@@ -108,6 +108,24 @@ class ArchiveConfig:
    retention: str = "never"   # one of: 1h 1d 2d 1w 1m 3m 6m 1y never


+@dataclass
+class ThermalConfig:
+    """Thermal-protection configuration.
+
+    Before running a request against a loaded model, wait until CPU/GPU
+    temperatures are within safe limits so a long sequence of heavy
+    generations can't overheat the machine and trip its power-off protection.
+    Thresholds are in degrees Celsius. CPU and GPU can be toggled separately.
+    """
+    cpu_enabled: bool = True
+    gpu_enabled: bool = True
+    cpu_high: float = 90.0      # pause when CPU reaches this temperature
+    cpu_resume: float = 87.0    # resume once CPU drops back to/below this
+    gpu_high: float = 90.0      # pause when GPU reaches this temperature
+    gpu_resume: float = 87.0    # resume once GPU drops back to/below this
+    poll_seconds: float = 5.0   # how often to re-check while cooling down
+
+
 @dataclass
 class Config:
    """Main configuration class."""
@@ -120,6 +138,7 @@ class Config:
    image: ImageConfig = field(default_factory=ImageConfig)
    whisper: WhisperConfig = field(default_factory=WhisperConfig)
    archive: ArchiveConfig = field(default_factory=ArchiveConfig)
+    thermal: ThermalConfig = field(default_factory=ThermalConfig)
    broker: BrokerConfig = field(default_factory=BrokerConfig)
    system_prompt: Optional[str] = None
    tools_closer_prompt: bool = False
@@ -273,6 +292,7 @@ class ConfigManager:
                image=ImageConfig(**config_data.get("image", {})),
                whisper=WhisperConfig(**config_data.get("whisper", {})),
                archive=ArchiveConfig(**config_data.get("archive", {})),
+                thermal=ThermalConfig(**config_data.get("thermal", {})),
                broker=BrokerConfig(**config_data.get("broker", {})),
                system_prompt=config_data.get("system_prompt"),
                tools_closer_prompt=config_data.get("tools_closer_prompt", False),
@@ -382,6 +402,15 @@ class ConfigManager:
                "directory": self.config.archive.directory,
                "retention": self.config.archive.retention,
            },
+            "thermal": {
+                "cpu_enabled": self.config.thermal.cpu_enabled,
+                "gpu_enabled": self.config.thermal.gpu_enabled,
+                "cpu_high": self.config.thermal.cpu_high,
+                "cpu_resume": self.config.thermal.cpu_resume,
+                "gpu_high": self.config.thermal.gpu_high,
+                "gpu_resume": self.config.thermal.gpu_resume,
+                "poll_seconds": self.config.thermal.poll_seconds,
+            },
            "broker": {
                "enabled": self.config.broker.enabled,
                "base_url": self.config.broker.base_url,

--- a/codai/models/hf_loading.py
+++ b/codai/models/hf_loading.py
+# CoderAI - OpenAI-compatible API server
+# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
+#
+# This program is free software: you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation, either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+
+"""Shared HuggingFace/transformers loading helper.
+
+Translates a coderai per-model configuration (the uniform models.json schema)
+into ``from_pretrained`` kwargs so that EVERY transformers-based loader
+(spatial, embedding, audio-gen, vision, …) honours the same quantization,
+offload, flash-attention and memory settings as the text/image/video loaders.
+
+The configuration is the single source of truth — nothing here reads CLI args.
+"""
+
+import os
+from typing import Any, Dict, Optional
+
+
+def _norm(cfg: Optional[Dict[str, Any]]) -> Dict[str, Any]:
+    """Return the per-model config dict, unwrapping a forwarded `_raw_cfg`."""
+    if not cfg:
+        return {}
+    raw = cfg.get('_raw_cfg') if isinstance(cfg, dict) else None
+    merged = dict(cfg)
+    if isinstance(raw, dict):
+        # Raw entry fills any key the translated kwargs didn't set.
+        for k, v in raw.items():
+            merged.setdefault(k, v)
+    return merged
+
+
+def resolve_dtype(cfg: Optional[Dict[str, Any]], default: str = 'bf16'):
+    """Resolve torch dtype from the model's `precision` setting."""
+    import torch
+    precision = (_norm(cfg).get('precision') or default)
+    return {
+        'bf16': torch.bfloat16,
+        'f16':  torch.float16,
+        'fp16': torch.float16,
+        'f32':  torch.float32,
+        'fp32': torch.float32,
+    }.get(precision, torch.bfloat16 if default == 'bf16' else torch.float32)
+
+
+def build_quantization_config(cfg: Optional[Dict[str, Any]]):
+    """Build a transformers BitsAndBytesConfig from the model config, or None."""
+    c = _norm(cfg)
+    load_in_4bit = bool(c.get('load_in_4bit', False))
+    load_in_8bit = bool(c.get('load_in_8bit', False))
+    if not (load_in_4bit or load_in_8bit):
+        return None
+    try:
+        import torch
+        from transformers import BitsAndBytesConfig
+        if load_in_4bit:
+            return BitsAndBytesConfig(
+                load_in_4bit=True,
+                bnb_4bit_quant_type='nf4',
+                bnb_4bit_compute_dtype=resolve_dtype(cfg),
+                bnb_4bit_use_double_quant=True,
+            )
+        return BitsAndBytesConfig(load_in_8bit=True)
+    except Exception as e:
+        print(f"  Quantization requested but unavailable: {e}")
+        return None
+
+
+def _is_gguf_value(v) -> bool:
+    """True if a component_quantization value points to a GGUF file (path/URL)."""
+    return isinstance(v, str) and v.strip().lower().endswith('.gguf')
+
+
+def _normalize_quant_mode(mode) -> Optional[str]:
+    """Normalize a quant mode string to '2bit' / '4bit' / '8bit' / None.
+
+    2-bit uses the quanto backend (optimum-quanto); 4/8-bit use bitsandbytes.
+    GGUF file values are handled separately (see build_gguf_pipeline_components).
+    """
+    if mode in (None, '', 'none', 'off', False):
+        return None
+    if _is_gguf_value(mode):
+        return None  # GGUF handled elsewhere
+    m = str(mode).lower().replace('-', '').replace('_', '').replace(' ', '')
+    if m in ('2bit', '2', 'int2', 'quanto2'):
+        return '2bit'
+    if m in ('4bit', '4', 'int4', 'nf4', 'bnb4'):
+        return '4bit'
+    if m in ('8bit', '8', 'int8', 'bnb8'):
+        return '8bit'
+    return None
+
+
+def _discover_components(model_name: str) -> Dict[str, Any]:
+    """Return {component_name: [library, class_name]} from the pipeline config."""
+    out: Dict[str, Any] = {}
+    try:
+        from diffusers import DiffusionPipeline
+        for name, spec in DiffusionPipeline.load_config(model_name).items():
+            if name.startswith('_'):
+                continue
+            if isinstance(spec, (list, tuple)) and spec:
+                out[name] = list(spec)
+    except Exception:
+        pass
+    return out
+
+
+def build_pipeline_quant_config(model_name: str, cfg: Optional[Dict[str, Any]], dtype):
+    """Build a diffusers PipelineQuantizationConfig from a per-model config.
+
+    Honours an optional per-component override map ``component_quantization``
+    (e.g. {"transformer": "4bit", "text_encoder": "8bit", "vae": "none"}).
+    Supported per-component modes:
+      - "4bit" / "8bit": bitsandbytes (default backend)
+      - "2bit": optimum-quanto (int2) — requires `pip install optimum-quanto`
+      - a "*.gguf" path/URL: handled by build_gguf_pipeline_components, NOT here
+    When the map is absent it falls back to the global ``load_in_4bit`` /
+    ``load_in_8bit`` flag applied to all heavy components.
+
+    Returns ``(quant_config, description)`` or ``(None, '')``.
+    """
+    c = _norm(cfg)
+    comp_q = c.get('component_quantization') or {}
+    global_4 = bool(c.get('load_in_4bit', False))
+    global_8 = bool(c.get('load_in_8bit', False))
+    if not comp_q and not (global_4 or global_8):
+        return None, ''
+
+    try:
+        from diffusers.quantizers import PipelineQuantizationConfig
+        from diffusers import BitsAndBytesConfig as DiffBnb
+        from transformers import BitsAndBytesConfig as TfBnb
+    except Exception as e:
+        print(f"  Pipeline quantization unavailable: {e}")
+        return None, ''
+
+    comp_lib = {n: (s[0] if isinstance(s, list) and s else 'diffusers')
+                for n, s in _discover_components(model_name).items()}
+
+    def _is_heavy(name: str) -> bool:
+        return (name.startswith('transformer') or name == 'unet'
+                or name.startswith('text_encoder'))
+
+    def _quanto_cfg(lib: str):
+        # optimum-quanto int2 via the diffusers / transformers QuantoConfig.
+        import importlib.util
+        _have_quanto = False
+        try:
+            _have_quanto = importlib.util.find_spec('optimum.quanto') is not None
+        except Exception:
+            _have_quanto = False
+        if not _have_quanto:
+            print("  2-bit requested but optimum-quanto is not installed — "
+                  "run `pip install optimum-quanto`. Skipping (component stays "
+                  "full precision).")
+            return None
+        try:
+            if lib == 'transformers':
+                from transformers import QuantoConfig as QC
+            else:
+                from diffusers import QuantoConfig as QC
+            return QC(weights='int2')
+        except Exception as e:
+            print(f"  2-bit (quanto) unavailable: {e}")
+            return None
+
+    def _mk(lib: str, mode: str):
+        if mode == '2bit':
+            return _quanto_cfg(lib)
+        BnB = TfBnb if lib == 'transformers' else DiffBnb
+        if mode == '4bit':
+            return BnB(load_in_4bit=True, bnb_4bit_quant_type='nf4',
+                       bnb_4bit_compute_dtype=dtype, bnb_4bit_use_double_quant=True)
+        return BnB(load_in_8bit=True)
+
+    quant_mapping: Dict[str, Any] = {}
+    descs = []
+    if comp_q:
+        for name, raw_mode in comp_q.items():
+            mode = _normalize_quant_mode(raw_mode)  # GGUF/none → None here
+            if mode is None:
+                continue
+            cfg_obj = _mk(comp_lib.get(name, 'diffusers'), mode)
+            if cfg_obj is not None:
+                quant_mapping[name] = cfg_obj
+                descs.append(f"{name}:{mode}")
+    else:
+        mode = '4bit' if global_4 else '8bit'
+        targets = [n for n in comp_lib if _is_heavy(n)] or \
+            ['transformer', 'transformer_2', 'text_encoder', 'unet']
+        for name in targets:
+            cfg_obj = _mk(comp_lib.get(name, 'diffusers'), mode)
+            if cfg_obj is not None:
+                quant_mapping[name] = cfg_obj
+                descs.append(f"{name}:{mode}")
+
+    if not quant_mapping:
+        return None, ''
+    try:
+        return PipelineQuantizationConfig(quant_mapping=quant_mapping), ', '.join(descs)
+    except Exception as e:
+        print(f"  Pipeline quantization build failed: {e}")
+        return None, ''
+
+
+def build_gguf_pipeline_components(model_name: str, cfg: Optional[Dict[str, Any]], dtype):
+    """Load pipeline components from GGUF files (Q2_K..Q8_0 — incl. 5/6-bit).
+
+    For each ``component_quantization`` entry whose value is a ``*.gguf`` path or
+    URL, load that component via ``<Class>.from_single_file(..., GGUFQuantization
+    Config)`` so it can be passed to the pipeline's ``from_pretrained`` as a
+    pre-built component (e.g. ``transformer=<model>``).  The bit-width (Q5_K,
+    Q6_K, …) is embedded in the GGUF file itself.
+
+    Returns ``(components_dict, description)``; empty dict when none configured.
+    Only diffusers components (transformer*/unet/vae) are supported here;
+    GGUF text encoders are uncommon and skipped with a note.
+    """
+    c = _norm(cfg)
+    comp_q = c.get('component_quantization') or {}
+    gguf_entries = {n: v for n, v in comp_q.items() if _is_gguf_value(v)}
+    if not gguf_entries:
+        return {}, ''
+
+    try:
+        import diffusers
+        from diffusers import GGUFQuantizationConfig
+    except Exception as e:
+        print(f"  GGUF components unavailable: {e}")
+        return {}, ''
+
+    specs = _discover_components(model_name)  # name -> [library, class_name]
+    components: Dict[str, Any] = {}
+    descs = []
+    for name, path in gguf_entries.items():
+        spec = specs.get(name)
+        if not spec or spec[0] != 'diffusers':
+            print(f"  GGUF skip '{name}': only diffusers components supported "
+                  f"(got {spec}).")
+            continue
+        cls_name = spec[1] if len(spec) > 1 else None
+        cls = getattr(diffusers, cls_name, None) if cls_name else None
+        if cls is None or not hasattr(cls, 'from_single_file'):
+            print(f"  GGUF skip '{name}': no loadable class {cls_name}.")
+            continue
+        try:
+            print(f"  Loading GGUF component '{name}' from {path}")
+            model = cls.from_single_file(
+                path.strip(),
+                quantization_config=GGUFQuantizationConfig(compute_dtype=dtype),
+                torch_dtype=dtype,
+            )
+            components[name] = model
+            descs.append(f"{name}:gguf")
+        except Exception as e:
+            print(f"  GGUF load failed for '{name}' ({path}): {e}")
+    return components, ', '.join(descs)
+
+
+def build_from_pretrained_kwargs(
+    cfg: Optional[Dict[str, Any]],
+    *,
+    default_precision: str = 'bf16',
+    enable_flash: bool = True,
+) -> Dict[str, Any]:
+    """Build common ``from_pretrained`` kwargs from a coderai model config.
+
+    Honours: load_in_4bit/8bit, flash_attention, offload_strategy, offload_dir,
+    max_gpu_percent, manual_ram_gb, no_ram, precision.
+    """
+    import torch
+    c = _norm(cfg)
+    kwargs: Dict[str, Any] = {
+        'trust_remote_code': True,
+        'low_cpu_mem_usage': True,
+        'torch_dtype': resolve_dtype(cfg, default_precision),
+    }
+
+    # Quantization (transformers BitsAndBytesConfig)
+    quant = build_quantization_config(cfg)
+    if quant is not None:
+        kwargs['quantization_config'] = quant
+        bits = 4 if c.get('load_in_4bit') else 8
+        print(f"  HF quantization: {bits}-bit (bitsandbytes)")
+
+    # Flash attention — honour any of the three flash flags (the sdcpp ones are
+    # no-ops for transformers models, so an enabled sdcpp flag still signals the
+    # user's intent to use Flash-Attention-2 here).
+    _flash = (c.get('flash_attention', c.get('flash_attn', False))
+              or c.get('sdcpp_flash_attn', False)
+              or c.get('sdcpp_diffusion_flash_attn', False))
+    if enable_flash and _flash:
+        try:
+            import flash_attn  # noqa: F401
+            kwargs['attn_implementation'] = 'flash_attention_2'
+            print("  Flash Attention 2 enabled")
+        except Exception:
+            print("  Flash Attention 2 requested but not installed — ignoring")
+
+    # Offload / device placement
+    no_ram = bool(c.get('no_ram', False))
+    offload_strategy = (c.get('offload_strategy') or 'auto')
+    max_gpu_percent = c.get('max_gpu_percent')
+
+    if not torch.cuda.is_available():
+        return kwargs  # CPU-only host; let transformers place on CPU
+
+    if no_ram or offload_strategy == 'none':
+        # Everything on GPU, no CPU spill.
+        kwargs['device_map'] = {'': 0}
+        return kwargs
+
+    # Build a max_memory map so large models split GPU → CPU RAM → disk.
+    try:
+        import psutil
+        free_vram, total_vram = torch.cuda.mem_get_info(0)
+        headroom = 512 * 1024 * 1024
+        if max_gpu_percent is not None:
+            gpu_budget = int(total_vram * max(0.0, min(1.0, float(max_gpu_percent) / 100.0)))
+            gpu_budget = min(gpu_budget, max(0, free_vram - headroom))
+        else:
+            gpu_budget = max(0, free_vram - headroom)
+
+        manual_ram_gb = c.get('manual_ram_gb')
+        if manual_ram_gb:
+            cpu_budget = int(float(manual_ram_gb) * 1e9)
+        else:
+            cpu_budget = max(0, psutil.virtual_memory().available - int(4e9))
+
+        kwargs['device_map'] = 'auto'
+        kwargs['max_memory'] = {0: gpu_budget, 'cpu': cpu_budget}
+
+        # Disk overflow when offloading is allowed.
+        offload_dir = c.get('offload_dir') or os.path.join(
+            os.path.expanduser('~'), '.cache', 'coderai', 'offload')
+        offload_dir = os.path.expanduser(offload_dir)
+        os.makedirs(offload_dir, exist_ok=True)
+        kwargs['offload_folder'] = offload_dir
+        kwargs['offload_buffers'] = True
+    except Exception as e:
+        print(f"  Could not build offload map ({e}); loading with device_map=auto")
+        kwargs['device_map'] = 'auto'
+
+    return kwargs
+
+
+def pipeline_device_kwargs(cfg: Optional[Dict[str, Any]]) -> Dict[str, Any]:
+    """Return kwargs for HF ``pipeline(...)`` honouring quantization/offload.
+
+    HF pipelines accept ``model_kwargs`` (passed to from_pretrained) and a
+    ``device_map``/``torch_dtype``.  We funnel the same config through.
+    """
+    base = build_from_pretrained_kwargs(cfg)
+    pk: Dict[str, Any] = {}
+    model_kwargs: Dict[str, Any] = {}
+    for k in ('quantization_config', 'attn_implementation', 'max_memory',
+              'offload_folder', 'offload_buffers', 'low_cpu_mem_usage',
+              'trust_remote_code'):
+        if k in base:
+            model_kwargs[k] = base[k]
+    if 'torch_dtype' in base:
+        pk['torch_dtype'] = base['torch_dtype']
+    if 'device_map' in base:
+        pk['device_map'] = base['device_map']
+    if model_kwargs:
+        pk['model_kwargs'] = model_kwargs
+    return pk
--- a/codai/platform_paths.py
+++ b/codai/platform_paths.py
@@ -112,6 +112,10 @@ def default_environments_dir() -> Path:
    return ensure_dir(legacy_style_config_dir() / "environments")


+def default_loras_dir() -> Path:
+    return ensure_dir(legacy_style_config_dir() / "loras")
+
+
 def default_whisper_server_path() -> str:
    if os.name == "nt":
        local = _windows_dir("LOCALAPPDATA", _home_dir() / "AppData" / "Local")

--- a/codai/pydantic/videorequest.py
+++ b/codai/pydantic/videorequest.py
@@ -20,6 +20,14 @@ from typing import Dict, List, Optional
 from pydantic import BaseModel, ConfigDict


+class VideoLoraConfig(BaseModel):
+    """A LoRA adapter to apply to the video diffusion pipeline for one request."""
+    model: str                          # path or HF id of the LoRA weights
+    weight: float = 1.0
+    name: Optional[str] = None
+    model_config = ConfigDict(extra="allow")
+
+
 class CharacterDialogLine(BaseModel):
    """One spoken line in a multi-character dialog sequence."""
    character: Optional[str] = None    # character profile name (used for lip-sync face)
@@ -78,6 +86,10 @@ class VideoGenerationRequest(BaseModel):
    # Named saved profiles to load (resolved server-side)
    character_profiles: Optional[List[str]] = None

+    # Per-request LoRA adapters (e.g. trained per-character identity LoRAs).
+    # Applied to diffusers video pipelines that support load_lora_weights.
+    loras: Optional[List[VideoLoraConfig]] = None
+
    # ── Audio generation / manipulation ──────────────────────────────────
    add_audio: Optional[bool] = False
    audio_type: Optional[str] = None        # music | speech | sfx | ambient

--- a/tools/review_outputs.py
+++ b/tools/review_outputs.py
+#!/usr/bin/env python3
+"""
+Township Fighters — Output Review & LoRA Training UI
+
+A lightweight web UI for reviewing generated characters, environments, and
+videos, collecting good/bad ratings, and exporting approved images as a LoRA
+training dataset.
+
+Usage:
+    python tools/review_outputs.py [--out-dir ./township_output] [--port 7860]
+
+Then open http://localhost:7860 in your browser.
+
+LoRA export creates:
+    <out-dir>/lora_dataset/
+        images/          ← approved images
+        metadata.jsonl   ← caption per image (for dreambooth-style training)
+        train_lora.sh    ← ready-to-run training command
+
+Requirements:
+    pip install diffusers accelerate peft  (for training)
+    pip install Pillow                      (for thumbnail generation, usually present)
+"""
+
+import argparse
+import base64
+import http.server
+import io
+import json
+import mimetypes
+import os
+import shutil
+import subprocess
+import sys
+import threading
+import time
+import urllib.parse
+from pathlib import Path
+from typing import Optional
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# Feedback persistence
+# ─────────────────────────────────────────────────────────────────────────────
+
+FEEDBACK_FILE = "feedback.json"
+
+
+def _feedback_path(out_dir: Path) -> Path:
+    return out_dir / FEEDBACK_FILE
+
+
+def load_feedback(out_dir: Path) -> dict:
+    p = _feedback_path(out_dir)
+    if p.exists():
+        try:
+            return json.loads(p.read_text())
+        except Exception:
+            pass
+    return {"version": 1, "items": {}}
+
+
+def save_feedback(out_dir: Path, data: dict):
+    _feedback_path(out_dir).write_text(json.dumps(data, indent=2))
+
+
+def set_rating(out_dir: Path, rel_path: str, rating: str, note: str = ""):
+    data = load_feedback(out_dir)
+    data["items"][rel_path] = {
+        "rating": rating,   # "good" | "bad" | "skip"
+        "note": note,
+        "timestamp": int(time.time()),
+    }
+    save_feedback(out_dir, data)
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# Output discovery
+# ─────────────────────────────────────────────────────────────────────────────
+
+def discover_outputs(out_dir: Path) -> dict:
+    """Return structured inventory of everything in the output directory."""
+    inv = {"characters": {}, "environments": {}, "videos": []}
+
+    chars_dir = out_dir / "characters"
+    if chars_dir.exists():
+        for char in sorted(chars_dir.iterdir()):
+            if not char.is_dir():
+                continue
+            meta_file = char / "meta.json"
+            meta = {}
+            if meta_file.exists():
+                try:
+                    meta = json.loads(meta_file.read_text())
+                except Exception:
+                    pass
+            images = sorted(char.glob("ref_*.png")) + sorted(char.glob("ref_*.jpg"))
+            inv["characters"][char.name] = {
+                "meta": meta,
+                "images": [str(p.relative_to(out_dir)) for p in images],
+            }
+
+    envs_dir = out_dir / "environments"
+    if envs_dir.exists():
+        for env in sorted(envs_dir.iterdir()):
+            if not env.is_dir():
+                continue
+            meta_file = env / "meta.json"
+            meta = {}
+            if meta_file.exists():
+                try:
+                    meta = json.loads(meta_file.read_text())
+                except Exception:
+                    pass
+            images = sorted(env.glob("ref_*.png")) + sorted(env.glob("ref_*.jpg"))
+            inv["environments"][env.name] = {
+                "meta": meta,
+                "images": [str(p.relative_to(out_dir)) for p in images],
+            }
+
+    videos_dir = out_dir / "videos"
+    if videos_dir.exists():
+        clips = sorted(videos_dir.glob("*_clip*.mp4"))
+        finals = [p for p in sorted(videos_dir.glob("*.mp4")) if p not in clips]
+        inv["videos"] = [str(p.relative_to(out_dir)) for p in finals + clips]
+
+    return inv
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# LoRA training export
+# ─────────────────────────────────────────────────────────────────────────────
+
+def export_lora_dataset(out_dir: Path, base_model: Optional[str] = None,
+                        steps: int = 500, lr: str = "1e-4") -> dict:
+    """
+    Collect all "good"-rated images + their prompts and write a
+    dreambooth-compatible dataset under <out-dir>/lora_dataset/.
+    Returns a summary dict.
+    """
+    feedback = load_feedback(out_dir)
+    good = {k: v for k, v in feedback["items"].items()
+            if v.get("rating") == "good"}
+
+    lora_dir = out_dir / "lora_dataset"
+    imgs_dir = lora_dir / "images"
+    imgs_dir.mkdir(parents=True, exist_ok=True)
+
+    meta_lines = []
+    copied = 0
+    skipped = 0
+
+    for rel_path, fb in sorted(good.items()):
+        src = out_dir / rel_path
+        if not src.exists() or not rel_path.lower().endswith((".png", ".jpg", ".jpeg")):
+            skipped += 1
+            continue
+
+        # Build a caption from the meta.json stored alongside the image
+        parts = Path(rel_path).parts   # e.g. ("characters", "khumalo", "ref_00.png")
+        caption = fb.get("note", "").strip()
+        if not caption and len(parts) >= 2:
+            category = parts[0]        # "characters" or "environments"
+            name = parts[1]
+            # Look for meta.json
+            meta_path = out_dir / category / name / "meta.json"
+            if meta_path.exists():
+                try:
+                    meta = json.loads(meta_path.read_text())
+                    caption = meta.get("prompt", "") or meta.get("description", "")
+                except Exception:
+                    pass
+            if not caption:
+                caption = f"{name.replace('_', ' ')}, African township fighter, cinematic"
+
+        dest_name = rel_path.replace("/", "_").replace("\\", "_")
+        dest = imgs_dir / dest_name
+        shutil.copy2(src, dest)
+        meta_lines.append(json.dumps({
+            "file_name": f"images/{dest_name}",
+            "text": caption,
+        }))
+        copied += 1
+
+    if not meta_lines:
+        return {"ok": False, "error": "No good-rated images found. Rate some images first."}
+
+    (lora_dir / "metadata.jsonl").write_text("\n".join(meta_lines) + "\n")
+
+    # Detect any existing LoRA to extend
+    existing_lora = _find_existing_lora(lora_dir)
+
+    # Write a ready-to-run training script
+    model = base_model or "stabilityai/stable-diffusion-xl-base-1.0"
+    _write_train_script(lora_dir, model, existing_lora, steps=steps, lr=lr)
+
+    result = {
+        "ok": True,
+        "dataset_dir": str(lora_dir),
+        "images": copied,
+        "skipped": skipped,
+        "train_script": str(lora_dir / "train_lora.sh"),
+        "metadata": str(lora_dir / "metadata.jsonl"),
+    }
+    if existing_lora:
+        result["extending"] = str(existing_lora)
+    return result
+
+
+def _find_existing_lora(lora_dir: Path) -> Optional[Path]:
+    """
+    Return the most recent LoRA weights to extend from, checking in order:
+    1. Latest checkpoint-NNNN/ subdirectory inside lora_weights/
+    2. Most recently modified .safetensors file inside lora_weights/
+    """
+    weights_dir = lora_dir / "lora_weights"
+    if not weights_dir.exists():
+        return None
+
+    # Prefer the highest-numbered checkpoint directory (resumable mid-training)
+    checkpoints = sorted(
+        [d for d in weights_dir.iterdir()
+         if d.is_dir() and d.name.startswith("checkpoint-")],
+        key=lambda d: int(d.name.split("-")[-1])
+    )
+    if checkpoints:
+        return checkpoints[-1]
+
+    # Fall back to the most recently modified .safetensors file
+    safetensors = sorted(
+        weights_dir.glob("*.safetensors"),
+        key=lambda p: p.stat().st_mtime,
+        reverse=True,
+    )
+    return safetensors[0] if safetensors else None
+
+
+def _write_train_script(lora_dir: Path, base_model: str,
+                        existing_lora: Optional[Path] = None,
+                        steps: int = 500, lr: str = "1e-4"):
+    """
+    Write train_lora.sh.
+    - If existing_lora is a checkpoint-NNNN/ dir  → resume via --resume_from_checkpoint
+    - If existing_lora is a .safetensors file      → initialize LoRA from it, continue training
+    - If None                                       → fresh LoRA from scratch
+    """
+    weights_dir = lora_dir / "lora_weights"
+
+    if existing_lora and existing_lora.is_dir():
+        # Mid-training checkpoint: trainer can resume exactly
+        resume_flag = f'--resume_from_checkpoint="{existing_lora}"'
+        extend_note = f"# Resuming from checkpoint: {existing_lora}"
+        init_flag = ""
+    elif existing_lora and existing_lora.suffix == ".safetensors":
+        # Completed LoRA: load its adapter weights as starting point.
+        # The diffusers trainer supports --lora_model_name_or_path for this.
+        resume_flag = ""
+        init_flag = f'--lora_model_name_or_path="{existing_lora}" \\'
+        extend_note = f"# Extending existing LoRA: {existing_lora}"
+    else:
+        resume_flag = ""
+        init_flag = ""
+        extend_note = "# Fresh LoRA training from scratch"
+
+    resume_line = f"  {resume_flag} \\\n" if resume_flag else ""
+    init_line   = f"  {init_flag}\n" if init_flag else ""
+
+    train_sh = f"""#!/bin/bash
+# LoRA training script — generated by review_outputs.py
+# Requires: pip install diffusers accelerate peft transformers
+{extend_note}
+
+DATASET_DIR="{lora_dir}"
+OUTPUT_DIR="{weights_dir}"
+BASE_MODEL="{base_model}"
+
+mkdir -p "$OUTPUT_DIR"
+
+accelerate launch --mixed_precision="fp16" \\
+  -m diffusers.scripts.train_dreambooth_lora_sdxl \\
+  --pretrained_model_name_or_path="$BASE_MODEL" \\
+  --dataset_name="$DATASET_DIR" \\
+  --output_dir="$OUTPUT_DIR" \\
+  --mixed_precision="fp16" \\
+  --resolution=1024 \\
+  --train_batch_size=1 \\
+  --gradient_accumulation_steps=4 \\
+  --learning_rate={lr} \\
+  --lr_scheduler="constant" \\
+  --lr_warmup_steps=0 \\
+  --max_train_steps={steps} \\
+  --checkpointing_steps=100 \\
+  --seed=42 \\
+  --report_to="none" \\
+{resume_line}{init_line}
+echo ""
+echo "LoRA weights saved to: $OUTPUT_DIR"
+echo "To use: add the .safetensors file to your CoderAI model config as a LoRA."
+"""
+    script = lora_dir / "train_lora.sh"
+    script.write_text(train_sh)
+    script.chmod(0o755)
+
+
+def run_lora_training(out_dir: Path, steps: int = 500, lr: str = "1e-4") -> dict:
+    """Launch the generated train_lora.sh in the background."""
+    lora_dir = out_dir / "lora_dataset"
+    script = lora_dir / "train_lora.sh"
+    if not script.exists():
+        return {"ok": False, "error": "No training script found. Export dataset first."}
+    # Re-generate script with latest steps/lr in case user changed them
+    existing = _find_existing_lora(lora_dir)
+    meta = lora_dir / "metadata.jsonl"
+    if meta.exists():
+        # Read base model from the existing script if we have one
+        base_model = "stabilityai/stable-diffusion-xl-base-1.0"
+        try:
+            for line in script.read_text().splitlines():
+                if line.strip().startswith("BASE_MODEL="):
+                    base_model = line.split("=", 1)[1].strip().strip('"')
+                    break
+        except Exception:
+            pass
+        _write_train_script(lora_dir, base_model, existing, steps=steps, lr=lr)
+    log_path = out_dir / "lora_dataset" / "training.log"
+    proc = subprocess.Popen(
+        ["bash", str(script)],
+        stdout=open(log_path, "w"),
+        stderr=subprocess.STDOUT,
+        cwd=str(out_dir),
+    )
+    return {
+        "ok": True,
+        "pid": proc.pid,
+        "log": str(log_path),
+        "message": f"Training started (PID {proc.pid}). Watch {log_path}",
+    }
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# Embedded HTML/JS UI
+# ─────────────────────────────────────────────────────────────────────────────
+
+_HTML = r"""<!DOCTYPE html>
+<html lang="en">
+<head>
+<meta charset="utf-8">
+<title>Township Fighters — Review</title>
+<style>
+*{box-sizing:border-box;margin:0;padding:0}
+body{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;background:#111;color:#e0e0e0;min-height:100vh}
+header{background:#1a1a1a;border-bottom:1px solid #333;padding:.75rem 1.25rem;display:flex;align-items:center;gap:1rem;position:sticky;top:0;z-index:100}
+header h1{font-size:15px;font-weight:700;color:#fff}
+.tabs{display:flex;gap:.25rem}
+.tab{padding:.35rem .85rem;border-radius:5px;cursor:pointer;font-size:12px;font-weight:600;color:#999;background:transparent;border:1px solid transparent;transition:all .15s}
+.tab.active,.tab:hover{background:#2a2a2a;color:#fff;border-color:#444}
+.tab.active{background:#6366f1;border-color:#6366f1;color:#fff}
+.stats{margin-left:auto;font-size:11px;color:#666}
+.stats span{color:#888}
+main{padding:1.25rem;max-width:1400px;margin:0 auto}
+.section{display:none}.section.active{display:block}
+.grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(180px,1fr));gap:1rem}
+.card{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;overflow:hidden;transition:border-color .15s}
+.card:hover{border-color:#444}
+.card.good{border-color:#22c55e}
+.card.bad{border-color:#ef4444}
+.card.skip{border-color:#f59e0b}
+.thumb{width:100%;aspect-ratio:1;object-fit:cover;display:block;cursor:pointer;background:#0a0a0a}
+.card-body{padding:.5rem}
+.card-name{font-size:11px;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;margin-bottom:.4rem}
+.card-actions{display:flex;gap:.3rem}
+.btn{flex:1;padding:.3rem;border-radius:4px;border:none;cursor:pointer;font-size:11px;font-weight:600;transition:opacity .1s}
+.btn:hover{opacity:.8}
+.btn-good{background:#16a34a;color:#fff}
+.btn-bad{background:#dc2626;color:#fff}
+.btn-skip{background:#d97706;color:#fff}
+.btn-clear{background:#2a2a2a;color:#aaa;font-size:10px}
+.note-input{width:100%;margin-top:.35rem;padding:.25rem .4rem;background:#111;border:1px solid #333;border-radius:3px;color:#ccc;font-size:10px;resize:none}
+.group-header{font-size:13px;font-weight:700;color:#ddd;margin:1.25rem 0 .6rem;padding-bottom:.3rem;border-bottom:1px solid #2a2a2a}
+.video-card{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;overflow:hidden;transition:border-color .15s}
+.video-card:hover{border-color:#444}
+.video-card.good{border-color:#22c55e}.video-card.bad{border-color:#ef4444}.video-card.skip{border-color:#f59e0b}
+.video-card video{width:100%;display:block;max-height:200px;background:#000}
+.video-body{padding:.5rem}
+.video-name{font-size:11px;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;margin-bottom:.4rem}
+.video-grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(300px,1fr));gap:1rem}
+.train-panel{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;padding:1.25rem;max-width:680px}
+.train-panel h2{font-size:14px;font-weight:700;margin-bottom:.75rem}
+.train-row{display:flex;gap:.75rem;align-items:center;margin-bottom:.75rem}
+.train-label{font-size:12px;color:#aaa;width:130px;flex-shrink:0}
+.train-input{flex:1;padding:.4rem .6rem;background:#111;border:1px solid #333;border-radius:4px;color:#ddd;font-size:12px}
+.action-btn{padding:.55rem 1.2rem;background:#6366f1;color:#fff;border:none;border-radius:5px;cursor:pointer;font-size:13px;font-weight:600}
+.action-btn:hover{background:#4f46e5}
+.action-btn.danger{background:#dc2626}
+.result-box{background:#111;border:1px solid #333;border-radius:5px;padding:.75rem;font-size:12px;color:#aaa;margin-top:.75rem;white-space:pre-wrap;display:none}
+.result-box.visible{display:block}
+.counter-badge{display:inline-block;padding:.1rem .4rem;border-radius:3px;font-size:10px;font-weight:700;margin-left:.3rem}
+.good-badge{background:#16a34a22;color:#4ade80}
+.bad-badge{background:#dc262622;color:#f87171}
+.skip-badge{background:#d9770622;color:#fbbf24}
+.lightbox{display:none;position:fixed;inset:0;background:rgba(0,0,0,.9);z-index:9999;align-items:center;justify-content:center;cursor:zoom-out}
+.lightbox.visible{display:flex}
+.lightbox img{max-width:90vw;max-height:90vh;object-fit:contain;border-radius:6px}
+</style>
+</head>
+<body>
+<div id="lightbox" class="lightbox" onclick="closeLightbox()"><img id="lb-img" src=""></div>
+<header>
+  <h1>Township Fighters — Review</h1>
+  <div class="tabs">
+    <div class="tab active" onclick="switchTab('characters',this)">Characters <span id="t-chars" class="counter-badge good-badge"></span></div>
+    <div class="tab" onclick="switchTab('environments',this)">Environments <span id="t-envs" class="counter-badge good-badge"></span></div>
+    <div class="tab" onclick="switchTab('videos',this)">Videos <span id="t-vids" class="counter-badge good-badge"></span></div>
+    <div class="tab" onclick="switchTab('training',this)">Training</div>
+  </div>
+  <div class="stats"><span id="stat-good">0</span> good · <span id="stat-bad">0</span> bad · <span id="stat-skip">0</span> maybe</div>
+</header>
+<main>
+  <div id="sec-characters" class="section active"></div>
+  <div id="sec-environments" class="section"></div>
+  <div id="sec-videos" class="section"></div>
+  <div id="sec-training" class="section">
+    <div class="train-panel">
+      <h2>LoRA Training from Approved Images</h2>
+
+      <div id="extend-notice" style="display:none;background:#1a2a1a;border:1px solid #2d4a2d;border-radius:5px;padding:.6rem .8rem;margin-bottom:.8rem;font-size:12px;color:#86efac;line-height:1.5"></div>
+
+      <div class="train-row">
+        <span class="train-label">Base image model</span>
+        <input id="base-model" class="train-input" value="" placeholder="e.g. John6666/pornmaster-pro-pony-asianponyv3vae-sdxl">
+      </div>
+      <div class="train-row">
+        <span class="train-label">Extra steps</span>
+        <input id="extra-steps" class="train-input" type="number" value="500" min="50" max="5000"
+               title="Total new steps to run (added on top of any existing training)">
+      </div>
+      <div class="train-row">
+        <span class="train-label">Learning rate</span>
+        <input id="lr" class="train-input" value="1e-4" placeholder="1e-4">
+      </div>
+
+      <div style="display:flex;gap:.75rem;flex-wrap:wrap;margin-top:.25rem">
+        <button class="action-btn" onclick="exportDataset()">1. Export / refresh dataset</button>
+        <button class="action-btn" onclick="trainLora()">2. Train / extend LoRA</button>
+      </div>
+      <div id="train-result" class="result-box"></div>
+
+      <div style="margin-top:1rem;font-size:11px;color:#555;line-height:1.7">
+        <b style="color:#888">First run (fresh LoRA):</b><br>
+        &nbsp;1. Rate images 👍 Good in Characters / Environments tabs.<br>
+        &nbsp;2. Click <b>Export dataset</b> → copies approved images + prompts to <code>lora_dataset/</code>.<br>
+        &nbsp;3. Click <b>Train LoRA</b> → runs training in background, saves <code>lora_weights/*.safetensors</code>.<br>
+        &nbsp;4. Add the <code>.safetensors</code> to CoderAI's image model LoRA settings.<br>
+        <br>
+        <b style="color:#888">Subsequent runs (extending an existing LoRA):</b><br>
+        &nbsp;1. Generate more content, rate new images as 👍 Good.<br>
+        &nbsp;2. Click <b>Export dataset</b> → adds new images alongside old ones.<br>
+        &nbsp;3. Click <b>Train LoRA</b> → auto-detects existing weights and continues from them.<br>
+        &nbsp;&nbsp;&nbsp;&nbsp;If a <code>checkpoint-NNNN/</code> dir exists → resumes exactly from that point.<br>
+        &nbsp;&nbsp;&nbsp;&nbsp;If only a <code>.safetensors</code> exists → initializes adapter from it, then trains.<br>
+        <br>
+        <b style="color:#888">Requirements:</b> <code>pip install peft accelerate</code>
+      </div>
+    </div>
+  </div>
+</main>
+
+<script>
+let _inv = {}, _fb = {};
+
+async function api(path, body) {
+  const opts = body
+    ? {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify(body)}
+    : {method:'GET'};
+  const r = await fetch(path, opts);
+  return r.json();
+}
+
+async function init() {
+  const data = await api('/api/data');
+  _inv = data.inventory;
+  _fb  = data.feedback;
+  renderCharacters();
+  renderEnvironments();
+  renderVideos();
+  updateStats();
+}
+
+function switchTab(name, el) {
+  document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
+  document.querySelectorAll('.section').forEach(s => s.classList.remove('active'));
+  el.classList.add('active');
+  document.getElementById('sec-' + name).classList.add('active');
+  if (name === 'training') checkExistingLora();
+}
+
+function rating(relPath) { return (_fb[relPath] || {}).rating || ''; }
+function note(relPath)   { return (_fb[relPath] || {}).note   || ''; }
+
+function updateStats() {
+  const vals = Object.values(_fb);
+  document.getElementById('stat-good').textContent = vals.filter(v=>v.rating==='good').length;
+  document.getElementById('stat-bad').textContent  = vals.filter(v=>v.rating==='bad').length;
+  document.getElementById('stat-skip').textContent = vals.filter(v=>v.rating==='skip').length;
+
+  // Update tab badges
+  const charGood = Object.keys(_fb).filter(k=>k.startsWith('characters/')&&_fb[k].rating==='good').length;
+  const envGood  = Object.keys(_fb).filter(k=>k.startsWith('environments/')&&_fb[k].rating==='good').length;
+  const vidGood  = Object.keys(_fb).filter(k=>k.startsWith('videos/')&&_fb[k].rating==='good').length;
+  document.getElementById('t-chars').textContent = charGood || '';
+  document.getElementById('t-envs').textContent  = envGood  || '';
+  document.getElementById('t-vids').textContent  = vidGood  || '';
+}
+
+async function rate(relPath, rating, noteEl) {
+  const n = noteEl ? noteEl.value : (note(relPath) || '');
+  _fb[relPath] = {rating, note: n, timestamp: Date.now()/1000|0};
+  await api('/api/rate', {path: relPath, rating, note: n});
+  // Update card border
+  document.querySelectorAll(`[data-path="${CSS.escape(relPath)}"]`).forEach(el => {
+    el.classList.remove('good','bad','skip');
+    if (rating) el.classList.add(rating);
+  });
+  updateStats();
+}
+
+async function clearRate(relPath) {
+  delete _fb[relPath];
+  await api('/api/rate', {path: relPath, rating: '', note: ''});
+  document.querySelectorAll(`[data-path="${CSS.escape(relPath)}"]`).forEach(el => {
+    el.classList.remove('good','bad','skip');
+  });
+  updateStats();
+}
+
+function imageCard(relPath) {
+  const r = rating(relPath);
+  const n = note(relPath);
+  const fname = relPath.split('/').pop();
+  const id = relPath.replace(/[^a-z0-9]/gi,'_');
+  return `
+  <div class="card ${r}" data-path="${relPath}">
+    <img class="thumb" src="/file/${relPath}" alt="${fname}"
+         title="${relPath}" onclick="openLightbox('/file/${relPath}')">
+    <div class="card-body">
+      <div class="card-name" title="${relPath}">${fname}</div>
+      <div class="card-actions">
+        <button class="btn btn-good"  onclick="rate('${relPath}','good', document.getElementById('n_${id}'))">👍</button>
+        <button class="btn btn-bad"   onclick="rate('${relPath}','bad',  document.getElementById('n_${id}'))">👎</button>
+        <button class="btn btn-skip"  onclick="rate('${relPath}','skip', document.getElementById('n_${id}'))">🤔</button>
+        <button class="btn btn-clear" onclick="clearRate('${relPath}')">✕</button>
+      </div>
+      <textarea id="n_${id}" class="note-input" rows="2"
+                placeholder="Optional note…"
+                onblur="if(_fb['${relPath}'])rate('${relPath}',_fb['${relPath}'].rating,this)"
+      >${n}</textarea>
+    </div>
+  </div>`;
+}
+
+function videoCard(relPath) {
+  const r = rating(relPath);
+  const n = note(relPath);
+  const fname = relPath.split('/').pop();
+  const id = relPath.replace(/[^a-z0-9]/gi,'_');
+  return `
+  <div class="video-card ${r}" data-path="${relPath}">
+    <video controls preload="metadata" src="/file/${relPath}"></video>
+    <div class="video-body">
+      <div class="video-name" title="${relPath}">${fname}</div>
+      <div class="card-actions">
+        <button class="btn btn-good"  onclick="rate('${relPath}','good', document.getElementById('n_${id}'))">👍 Good</button>
+        <button class="btn btn-bad"   onclick="rate('${relPath}','bad',  document.getElementById('n_${id}'))">👎 Bad</button>
+        <button class="btn btn-skip"  onclick="rate('${relPath}','skip', document.getElementById('n_${id}'))">🤔 Maybe</button>
+        <button class="btn btn-clear" onclick="clearRate('${relPath}')">✕</button>
+      </div>
+      <textarea id="n_${id}" class="note-input" rows="2"
+                placeholder="Optional note…"
+                onblur="if(_fb['${relPath}'])rate('${relPath}',_fb['${relPath}'].rating,this)"
+      >${n}</textarea>
+    </div>
+  </div>`;
+}
+
+function renderCharacters() {
+  let html = '';
+  for (const [name, data] of Object.entries(_inv.characters || {})) {
+    const meta = data.meta || {};
+    const desc = meta.description || '';
+    html += `<div class="group-header">${name}<span style="font-weight:400;color:#666;font-size:11px;margin-left:.5rem">${desc}</span></div>`;
+    html += '<div class="grid">';
+    for (const img of data.images) html += imageCard(img);
+    html += '</div>';
+  }
+  document.getElementById('sec-characters').innerHTML = html || '<div style="color:#555;padding:2rem">No characters found in output directory.</div>';
+}
+
+function renderEnvironments() {
+  let html = '';
+  for (const [name, data] of Object.entries(_inv.environments || {})) {
+    const meta = data.meta || {};
+    const desc = meta.description || '';
+    html += `<div class="group-header">${name}<span style="font-weight:400;color:#666;font-size:11px;margin-left:.5rem">${desc}</span></div>`;
+    html += '<div class="grid">';
+    for (const img of data.images) html += imageCard(img);
+    html += '</div>';
+  }
+  document.getElementById('sec-environments').innerHTML = html || '<div style="color:#555;padding:2rem">No environments found in output directory.</div>';
+}
+
+function renderVideos() {
+  let html = '<div class="video-grid">';
+  for (const v of (_inv.videos || [])) html += videoCard(v);
+  html += '</div>';
+  document.getElementById('sec-videos').innerHTML =
+    (_inv.videos || []).length ? html : '<div style="color:#555;padding:2rem">No videos found in output directory.</div>';
+}
+
+async function exportDataset() {
+  const baseModel = document.getElementById('base-model').value.trim();
+  const steps = parseInt(document.getElementById('extra-steps').value) || 500;
+  const lr = document.getElementById('lr').value.trim() || '1e-4';
+  const box = document.getElementById('train-result');
+  box.textContent = 'Exporting dataset…';
+  box.classList.add('visible');
+  const r = await api('/api/export', {base_model: baseModel, steps, lr});
+  if (r.ok) {
+    let msg = `✓ Dataset ready:\n  ${r.images} image(s) → ${r.dataset_dir}\n  Script: ${r.train_script}`;
+    if (r.extending) msg += `\n\n  ↪ Will extend existing LoRA:\n    ${r.extending}`;
+    box.textContent = msg;
+    // Update the notice banner
+    const notice = document.getElementById('extend-notice');
+    if (r.extending) {
+      notice.style.display = '';
+      notice.innerHTML = `↪ Extending existing LoRA: <code>${r.extending}</code><br>
+        New approved images will be added on top — previous learning is preserved.`;
+    } else {
+      notice.style.display = 'none';
+    }
+  } else {
+    box.textContent = '✗ ' + (r.error || 'Export failed');
+  }
+}
+
+async function trainLora() {
+  const steps = parseInt(document.getElementById('extra-steps').value) || 500;
+  const lr = document.getElementById('lr').value.trim() || '1e-4';
+  const box = document.getElementById('train-result');
+  box.textContent = 'Starting training…';
+  box.classList.add('visible');
+  const r = await api('/api/train', {steps, lr});
+  box.textContent = r.ok ? `✓ ${r.message}\n\nLog file: ${r.log}` : '✗ ' + (r.error || 'Training failed');
+}
+
+// Check for existing LoRA on tab switch to training
+async function checkExistingLora() {
+  const r = await api('/api/lora-status');
+  const notice = document.getElementById('extend-notice');
+  if (r.existing_lora) {
+    notice.style.display = '';
+    notice.innerHTML = `↪ Existing LoRA detected: <code>${r.existing_lora}</code><br>
+      Exporting will continue from this checkpoint — previous learning is preserved.`;
+  }
+}
+
+function openLightbox(src) {
+  document.getElementById('lb-img').src = src;
+  document.getElementById('lightbox').classList.add('visible');
+}
+function closeLightbox() {
+  document.getElementById('lightbox').classList.remove('visible');
+}
+document.addEventListener('keydown', e => { if (e.key==='Escape') closeLightbox(); });
+
+init();
+</script>
+</body>
+</html>
+"""
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# HTTP server
+# ─────────────────────────────────────────────────────────────────────────────
+
+class ReviewHandler(http.server.BaseHTTPRequestHandler):
+    out_dir: Path = None
+
+    def log_message(self, fmt, *args):
+        pass   # suppress per-request access log
+
+    def _send(self, code: int, content_type: str, body: bytes):
+        self.send_response(code)
+        self.send_header("Content-Type", content_type)
+        self.send_header("Content-Length", str(len(body)))
+        self.end_headers()
+        self.wfile.write(body)
+
+    def _json(self, data: dict, code: int = 200):
+        body = json.dumps(data).encode()
+        self._send(code, "application/json", body)
+
+    def do_GET(self):
+        parsed = urllib.parse.urlparse(self.path)
+        path = parsed.path
+
+        if path == "/" or path == "":
+            self._send(200, "text/html; charset=utf-8", _HTML.encode())
+            return
+
+        if path == "/api/data":
+            inv = discover_outputs(self.out_dir)
+            fb = load_feedback(self.out_dir)
+            self._json({"inventory": inv, "feedback": fb["items"]})
+            return
+
+        if path == "/api/lora-status":
+            existing = _find_existing_lora(self.out_dir / "lora_dataset")
+            self._json({"existing_lora": str(existing) if existing else None})
+            return
+
+        if path.startswith("/file/"):
+            rel = urllib.parse.unquote(path[6:])
+            abs_path = self.out_dir / rel
+            # Security: resolve and ensure it's inside out_dir
+            try:
+                abs_path = abs_path.resolve()
+                self.out_dir.resolve()
+                abs_path.relative_to(self.out_dir.resolve())
+            except (ValueError, Exception):
+                self._send(403, "text/plain", b"Forbidden")
+                return
+            if not abs_path.exists():
+                self._send(404, "text/plain", b"Not found")
+                return
+            mime = mimetypes.guess_type(str(abs_path))[0] or "application/octet-stream"
+            self._send(200, mime, abs_path.read_bytes())
+            return
+
+        self._send(404, "text/plain", b"Not found")
+
+    def do_POST(self):
+        length = int(self.headers.get("Content-Length", 0))
+        body = json.loads(self.rfile.read(length) or b"{}")
+        path = urllib.parse.urlparse(self.path).path
+
+        if path == "/api/rate":
+            rel = body.get("path", "")
+            rating = body.get("rating", "")
+            note = body.get("note", "")
+            if rating:
+                set_rating(self.out_dir, rel, rating, note)
+            else:
+                # Clear
+                data = load_feedback(self.out_dir)
+                data["items"].pop(rel, None)
+                save_feedback(self.out_dir, data)
+            self._json({"ok": True})
+            return
+
+        if path == "/api/export":
+            result = export_lora_dataset(
+                self.out_dir,
+                base_model=body.get("base_model") or None,
+                steps=int(body.get("steps") or 500),
+                lr=body.get("lr") or "1e-4",
+            )
+            self._json(result)
+            return
+
+        if path == "/api/train":
+            result = run_lora_training(
+                self.out_dir,
+                steps=int(body.get("steps") or 500),
+                lr=body.get("lr") or "1e-4",
+            )
+            self._json(result)
+            return
+
+        self._send(404, "text/plain", b"Not found")
+
+
+# ─────────────────────────────────────────────────────────────────────────────
+# Main
+# ─────────────────────────────────────────────────────────────────────────────
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Township Fighters — Output Review & LoRA Training UI",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+WHAT THIS TOOL DOES
+───────────────────
+  Serves a web UI (no external dependencies beyond Python stdlib) where you
+  review everything gen_township_fighters.py produced, rate each item, and
+  progressively improve generation quality through LoRA fine-tuning.
+
+  Characters tab   — grid of all fighter reference images, one group per
+                     fighter. Click any image to enlarge it. Rate each one:
+                       👍 Good  — include in training data
+                       👎 Bad   — exclude (wrong face, bad lighting, etc.)
+                       🤔 Maybe — revisit later
+                     Add an optional text note per image.
+
+  Environments tab — same for environment reference images.
+
+  Videos tab       — inline video player for every generated clip (short,
+                     long, and outcome videos). Rate to track which prompts
+                     and settings produced the best results.
+
+  Training tab     — build and launch LoRA training from approved images:
+                       Step 1: Export dataset  → copies all 👍-rated images +
+                               their generation prompts into lora_dataset/
+                               (dreambooth format with metadata.jsonl).
+                       Step 2: Train LoRA      → runs lora_dataset/train_lora.sh
+                               via accelerate in a background process; saves
+                               .safetensors weights to lora_dataset/lora_weights/.
+                     You can also tune Steps (how long to train) and Learning
+                     rate directly in the tab before clicking either button.
+
+FEEDBACK STORAGE
+────────────────
+  Ratings are saved instantly to <out-dir>/feedback.json — no submit button.
+  The file is plain JSON and can be committed to version control to track
+  which outputs were acceptable across multiple generation runs.
+
+OUTPUT DIRECTORY STRUCTURE
+──────────────────────────
+  <out-dir>/
+    characters/<name>/ref_NN.png   ← fighter reference images
+    environments/<name>/ref_NN.png ← location reference images
+    videos/match_*.mp4             ← fight clips
+    feedback.json                  ← your ratings (written by this tool)
+    lora_dataset/
+      images/                      ← approved images copied here
+      metadata.jsonl               ← one caption per image (generation prompt)
+      train_lora.sh                ← ready-to-run training command
+      lora_weights/
+        pytorch_lora_weights.safetensors  ← final LoRA weights
+        checkpoint-100/            ← mid-training checkpoints
+        checkpoint-200/
+
+LORA TRAINING — FIRST RUN (fresh LoRA from scratch)
+────────────────────────────────────────────────────
+  Requirements:  pip install peft accelerate
+  (diffusers and transformers are already installed by CoderAI)
+
+  1. Generate content with gen_township_fighters.py.
+  2. Open this UI and rate images:  👍 for ones with good likeness/style.
+     Aim for at least 10-20 good images per subject for decent results.
+  3. In the Training tab, set the base model to the image model you used
+     (e.g. John6666/pornmaster-pro-pony-asianponyv3vae-sdxl).
+  4. Click "Export dataset" — verifies the dataset and shows a summary.
+  5. Click "Train LoRA" — training runs in the background (~5-15 min on
+     an RTX 3090 for 500 steps). Watch lora_dataset/training.log.
+  6. When done, add the .safetensors file to CoderAI:
+       Admin → Models → configure your image model → LoRA path
+
+LORA TRAINING — SUBSEQUENT RUNS (extending an existing LoRA)
+─────────────────────────────────────────────────────────────
+  After generating more fighters or environments and rating the new images:
+
+  1. Click "Export dataset" again — new approved images are added to the
+     dataset alongside the old ones.
+  2. The tool auto-detects any existing weights:
+       • If lora_weights/checkpoint-NNN/ exists  → uses --resume_from_checkpoint
+         (full optimizer state restored; smoothest convergence).
+       • If only a .safetensors file exists       → uses --lora_model_name_or_path
+         (adapter weights loaded as starting point; optimizer restarts).
+  3. A green banner in the Training tab confirms what will be extended.
+  4. Click "Train LoRA" — runs the additional steps on top of what was
+     already learned. Use a lower learning rate (e.g. 5e-5) for refinement.
+  5. The updated .safetensors replaces the old one in lora_weights/.
+
+  Each round of generate → rate → train improves quality incrementally.
+  There is no limit to how many rounds you can do.
+
+TUNING TIPS
+───────────
+  Steps        500  = quick first pass; good enough to see improvement
+               1000 = more thorough; use for final production LoRA
+  Learning rate
+    1e-4  = default for a fresh LoRA or large dataset additions
+    5e-5  = gentler refinement when extending an existing LoRA
+    1e-5  = very fine correction; use only with a mature LoRA
+
+  If generated characters look too generic after training:
+    → Add more diverse good images (different angles, lighting)
+    → Lower the learning rate on the next extension run
+
+  If generated characters over-fit (all look the same):
+    → Fewer steps or higher learning rate on the next run
+    → Drop some near-duplicate images from the dataset
+
+EXAMPLES
+────────
+  # Open the UI for the default output directory:
+  python tools/review_outputs.py
+
+  # Custom output directory:
+  python tools/review_outputs.py --out-dir /data/township_project
+
+  # Different port (useful if 7860 is taken by another tool):
+  python tools/review_outputs.py --port 8888
+
+  # Headless server — don't auto-open the browser:
+  python tools/review_outputs.py --no-open --port 7860
+
+  # Full workflow in one session:
+  python tools/gen_township_fighters.py --out-dir ./fights          # generate
+  python tools/review_outputs.py --out-dir ./fights                 # review & train
+  python tools/gen_township_fighters.py --out-dir ./fights \\        # generate more
+    --reuse-fighters --reuse-environments --skip-characters \\
+    --skip-environments
+  python tools/review_outputs.py --out-dir ./fights                 # extend LoRA
+""",
+    )
+    parser.add_argument("--out-dir", default="./township_output", metavar="DIR",
+                        help="Output directory from gen_township_fighters.py (default: ./township_output)")
+    parser.add_argument("--port", type=int, default=7860, metavar="PORT",
+                        help="Port to serve the UI on (default: 7860)")
+    parser.add_argument("--no-open", action="store_true",
+                        help="Do not automatically open the browser")
+    args = parser.parse_args()
+
+    out_dir = Path(args.out_dir).resolve()
+    if not out_dir.exists():
+        print(f"✗ Output directory not found: {out_dir}")
+        print("  Run gen_township_fighters.py first to generate content.")
+        sys.exit(1)
+
+    ReviewHandler.out_dir = out_dir
+
+    server = http.server.HTTPServer(("0.0.0.0", args.port), ReviewHandler)
+
+    inv = discover_outputs(out_dir)
+    n_chars = sum(len(v["images"]) for v in inv["characters"].values())
+    n_envs  = sum(len(v["images"]) for v in inv["environments"].values())
+    n_vids  = len(inv["videos"])
+    fb = load_feedback(out_dir)
+    n_rated = len(fb["items"])
+
+    print(f"""
+╔══════════════════════════════════════════════════════════╗
+║       Township Fighters — Output Review UI               ║
+╚══════════════════════════════════════════════════════════╝
+  Output dir : {out_dir}
+  Content    : {n_chars} character images · {n_envs} environment images · {n_vids} videos
+  Feedback   : {n_rated} item(s) already rated
+  URL        : http://localhost:{args.port}
+""")
+
+    if not args.no_open:
+        def _open():
+            time.sleep(0.4)
+            import webbrowser
+            webbrowser.open(f"http://localhost:{args.port}")
+        threading.Thread(target=_open, daemon=True).start()
+
+    try:
+        server.serve_forever()
+    except KeyboardInterrupt:
+        print("\n  Stopped.")
+
+
+if __name__ == "__main__":
+    main()
--- a/tools/video_dubber.py
+++ b/tools/video_dubber.py
+#!/usr/bin/env python3
+"""Dub a video/audio file through CoderAI API while preserving background audio.
+
+The script keeps orchestration, media slicing, timing, mixing, and muxing local.
+All AI work is delegated to CoderAI endpoints:
+  - /v1/audio/transcriptions for dialogue detection/transcription
+  - /v1/chat/completions for speaker assignment, translation, and metric fitting
+  - /v1/audio/voices for voice profiles
+  - /v1/audio/clone for cloned speech generation
+  - /v1/audio/convert for singing/performance voice conversion when requested
+  - /v1/audio/stems for optional dialogue/background separation
+
+External tools required locally: ffmpeg and ffprobe.
+Python dependency required: requests.
+"""
+
+from __future__ import annotations
+
+import argparse
+import base64
+import dataclasses
+import json
+import math
+import os
+import re
+import shutil
+import subprocess
+import sys
+import tempfile
+import textwrap
+import time
+import uuid
+from pathlib import Path
+from typing import Any, Iterable
+
+try:
+    import requests
+except ImportError as exc:  # pragma: no cover - user environment check
+    raise SystemExit("This script requires requests: pip install requests") from exc
+
+
+DEFAULT_BASE_URL = os.environ.get("CODERAI_BASE_URL", "http://127.0.0.1:8000")
+DEFAULT_API_KEY = os.environ.get("CODERAI_API_KEY")
+SERVICE_ENV_PREFIXES = {
+    "transcribe": "CODERAI_TRANSCRIBE",
+    "text": "CODERAI_TEXT",
+    "voice": "CODERAI_VOICE",
+    "convert": "CODERAI_CONVERT",
+    "stems": "CODERAI_STEMS",
+}
+AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".aac", ".flac", ".ogg", ".opus", ".webm"}
+VIDEO_EXTS = {".mp4", ".mkv", ".mov", ".avi", ".webm", ".m4v"}
+SRT_TIME_RE = re.compile(
+    r"(?P<h>\d{2}):(?P<m>\d{2}):(?P<s>\d{2})[,.](?P<ms>\d{1,3})\s*-->\s*"
+    r"(?P<eh>\d{2}):(?P<em>\d{2}):(?P<es>\d{2})[,.](?P<ems>\d{1,3})"
+)
+
+
+@dataclasses.dataclass
+class Segment:
+    index: int
+    start: float
+    end: float
+    text: str
+    speaker: str = "speaker_01"
+    translated: str = ""
+    is_singing: bool = False
+    voice_name: str = ""
+    ref_audio: Path | None = None
+    generated_audio: Path | None = None
+
+    @property
+    def duration(self) -> float:
+        return max(0.05, self.end - self.start)
+
+
+class CoderAIClient:
+    def __init__(self, base_url: str, api_key: str | None = None, timeout: int = 7200):
+        self.base_url = base_url.rstrip("/")
+        self.timeout = timeout
+        self.session = requests.Session()
+        if api_key:
+            self.session.headers["Authorization"] = f"Bearer {api_key}"
+
+    def _post_json(self, path: str, body: dict[str, Any]) -> dict[str, Any]:
+        response = self.session.post(f"{self.base_url}{path}", json=body, timeout=self.timeout)
+        if not response.ok:
+            raise RuntimeError(f"POST {path} failed: {response.status_code} {response.text[:800]}")
+        return response.json()
+
+    def _post_multipart(self, path: str, data: dict[str, Any], files: dict[str, Any]) -> Any:
+        response = self.session.post(f"{self.base_url}{path}", data=data, files=files, timeout=self.timeout)
+        if not response.ok:
+            raise RuntimeError(f"POST {path} failed: {response.status_code} {response.text[:800]}")
+        return response
+
+    def list_models(self) -> list[dict[str, Any]]:
+        response = self.session.get(f"{self.base_url}/v1/models", timeout=60)
+        if not response.ok:
+            return []
+        return response.json().get("data", [])
+
+    def transcribe(self, audio_path: Path, model: str, language: str | None) -> list[Segment]:
+        with audio_path.open("rb") as handle:
+            response = self._post_multipart(
+                "/v1/audio/transcriptions",
+                {
+                    "model": model,
+                    "language": language or "",
+                    "response_format": "srt",
+                    "temperature": "0",
+                },
+                {"file": (audio_path.name, handle, "application/octet-stream")},
+            )
+        return parse_srt(response.text)
+
+    def chat_json(self, model: str, system: str, user: str, max_tokens: int = 4096) -> Any:
+        data = self._post_json(
+            "/v1/chat/completions",
+            {
+                "model": model,
+                "messages": [
+                    {"role": "system", "content": system},
+                    {"role": "user", "content": user},
+                ],
+                "temperature": 0.2,
+                "max_tokens": max_tokens,
+            },
+        )
+        content = (data.get("choices") or [{}])[0].get("message", {}).get("content", "")
+        return extract_json(content)
+
+    def create_voice(self, name: str, audio_path: Path, transcript: str, description: str) -> None:
+        with audio_path.open("rb") as handle:
+            response = self.session.post(
+                f"{self.base_url}/v1/audio/voices",
+                data={"name": name, "transcript": transcript, "description": description},
+                files={"audio": (audio_path.name, handle, "audio/wav")},
+                timeout=self.timeout,
+            )
+        if response.status_code == 400 and "already" in response.text.lower():
+            return
+        if not response.ok:
+            raise RuntimeError(f"Create voice {name} failed: {response.status_code} {response.text[:800]}")
+
+    def clone_voice(self, voice_name: str, text: str, speed: float, out_path: Path) -> None:
+        data = self._post_json(
+            "/v1/audio/clone",
+            {"voice_name": voice_name, "text": text, "speed": speed, "response_format": "b64_wav"},
+        )
+        item = (data.get("data") or [{}])[0]
+        write_api_audio_item(item, out_path, self.session)
+
+    def convert_voice(
+        self,
+        source_audio: Path,
+        voice_name: str | None,
+        out_path: Path,
+        target_voice: Path | None = None,
+        f0_condition: bool = True,
+        length_adjust: float = 1.0,
+    ) -> None:
+        body: dict[str, Any] = {
+            "source_audio": file_data_uri(source_audio, "audio/wav"),
+            "f0_condition": f0_condition,
+            "length_adjust": length_adjust,
+            "response_format": "b64_wav",
+        }
+        if target_voice is not None:
+            body["target_voice"] = file_data_uri(target_voice, "audio/wav")
+        elif voice_name:
+            body["voice_name"] = voice_name
+        else:
+            raise RuntimeError("Voice conversion requires voice_name or target_voice")
+        data = self._post_json(
+            "/v1/audio/convert",
+            body,
+        )
+        item = (data.get("data") or [{}])[0]
+        write_api_audio_item(item, out_path, self.session)
+
+    def separate_stems(self, audio_path: Path, workdir: Path, fallback: bool) -> tuple[Path, Path] | None:
+        data = self._post_json(
+            "/v1/audio/stems",
+            {
+                "audio": file_data_uri(audio_path, "audio/wav"),
+                "stem_mode": "vocals-instrumental",
+                "fallback_mode": fallback,
+                "response_format": "b64_wav",
+            },
+        )
+        vocals = None
+        instrumental = None
+        for item in data.get("data", []):
+            target = workdir / f"stem_{item.get('name', uuid.uuid4().hex)}.wav"
+            write_api_audio_item(item, target, self.session)
+            role = (item.get("role") or item.get("name") or "").lower()
+            if "vocal" in role:
+                vocals = target
+            if "instrument" in role or "backing" in role:
+                instrumental = target
+        if vocals and instrumental:
+            return vocals, instrumental
+        return None
+
+
+@dataclasses.dataclass(frozen=True)
+class CoderAIClients:
+    default: CoderAIClient
+    transcribe: CoderAIClient
+    text: CoderAIClient
+    voice: CoderAIClient
+    convert: CoderAIClient
+    stems: CoderAIClient
+
+
+def run(cmd: list[str], *, timeout: int | None = None) -> None:
+    proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, timeout=timeout)
+    if proc.returncode != 0:
+        rendered = " ".join(cmd)
+        detail = proc.stderr.strip() or proc.stdout.strip() or "command failed"
+        raise RuntimeError(f"{rendered}\n{detail[:2000]}")
+
+
+def require_binary(name: str) -> str:
+    path = shutil.which(name)
+    if not path:
+        raise SystemExit(f"Required binary not found: {name}")
+    return path
+
+
+def media_duration(path: Path) -> float:
+    proc = subprocess.run(
+        ["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "json", str(path)],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+        text=True,
+    )
+    if proc.returncode != 0:
+        raise RuntimeError(proc.stderr.strip())
+    return float(json.loads(proc.stdout)["format"]["duration"])
+
+
+def is_video(path: Path) -> bool:
+    if path.suffix.lower() in VIDEO_EXTS:
+        return True
+    if path.suffix.lower() in AUDIO_EXTS:
+        return False
+    proc = subprocess.run(
+        ["ffprobe", "-v", "error", "-select_streams", "v:0", "-show_entries", "stream=codec_type", "-of", "csv=p=0", str(path)],
+        stdout=subprocess.PIPE,
+        stderr=subprocess.DEVNULL,
+        text=True,
+    )
+    return "video" in proc.stdout
+
+
+def extract_audio(input_path: Path, output_path: Path) -> None:
+    run(["ffmpeg", "-y", "-i", str(input_path), "-vn", "-ac", "2", "-ar", "44100", "-c:a", "pcm_s16le", str(output_path)])
+
+
+def slice_audio(input_path: Path, start: float, end: float, output_path: Path) -> None:
+    run([
+        "ffmpeg",
+        "-y",
+        "-ss",
+        f"{start:.3f}",
+        "-to",
+        f"{end:.3f}",
+        "-i",
+        str(input_path),
+        "-ac",
+        "1",
+        "-ar",
+        "22050",
+        "-c:a",
+        "pcm_s16le",
+        str(output_path),
+    ])
+
+
+def adjust_audio_timing(input_path: Path, target_duration: float, output_path: Path, max_stretch: float) -> None:
+    source_duration = media_duration(input_path)
+    if source_duration <= 0:
+        raise RuntimeError(f"Invalid generated audio duration for {input_path}")
+    ratio = source_duration / target_duration
+    ratio = min(max(ratio, 1.0 / max_stretch), max_stretch)
+    filters = []
+    if abs(ratio - 1.0) > 0.03:
+        filters.append(atempo_chain(ratio))
+    if source_duration / ratio < target_duration:
+        filters.append(f"apad=whole_dur={target_duration:.3f}")
+    filter_arg = ",".join(filters) if filters else "anull"
+    run([
+        "ffmpeg",
+        "-y",
+        "-i",
+        str(input_path),
+        "-af",
+        filter_arg,
+        "-t",
+        f"{target_duration:.3f}",
+        "-ac",
+        "2",
+        "-ar",
+        "44100",
+        "-c:a",
+        "pcm_s16le",
+        str(output_path),
+    ])
+
+
+def atempo_chain(ratio: float) -> str:
+    parts: list[str] = []
+    remaining = ratio
+    while remaining > 2.0:
+        parts.append("atempo=2.0")
+        remaining /= 2.0
+    while remaining < 0.5:
+        parts.append("atempo=0.5")
+        remaining /= 0.5
+    parts.append(f"atempo={remaining:.6f}")
+    return ",".join(parts)
+
+
+def build_dub_track(segments: list[Segment], duration: float, out_path: Path, workdir: Path) -> None:
+    silence = workdir / "silence.wav"
+    run([
+        "ffmpeg",
+        "-y",
+        "-f",
+        "lavfi",
+        "-i",
+        "anullsrc=channel_layout=stereo:sample_rate=44100",
+        "-t",
+        f"{duration:.3f}",
+        "-c:a",
+        "pcm_s16le",
+        str(silence),
+    ])
+    inputs = ["-i", str(silence)]
+    filter_parts = []
+    mix_inputs = ["[0:a]"]
+    input_index = 1
+    for segment in segments:
+        if not segment.generated_audio:
+            continue
+        inputs.extend(["-i", str(segment.generated_audio)])
+        delay_ms = max(0, int(round(segment.start * 1000)))
+        filter_parts.append(
+            f"[{input_index}:a]adelay={delay_ms}|{delay_ms},volume=1.0[d{input_index}]"
+        )
+        mix_inputs.append(f"[d{input_index}]")
+        input_index += 1
+    if len(mix_inputs) == 1:
+        run(["ffmpeg", "-y", "-i", str(silence), "-c:a", "pcm_s16le", str(out_path)])
+        return
+    filter_parts.append(f"{''.join(mix_inputs)}amix=inputs={len(mix_inputs)}:duration=longest:normalize=0[out]")
+    run(["ffmpeg", "-y", *inputs, "-filter_complex", ";".join(filter_parts), "-map", "[out]", "-t", f"{duration:.3f}", str(out_path)])
+
+
+def duck_background(original_audio: Path, segments: list[Segment], out_path: Path, workdir: Path, duck_db: float) -> None:
+    volume = 10 ** (duck_db / 20.0)
+    mask = workdir / "dialogue_mask.wav"
+    duration = media_duration(original_audio)
+    silence = workdir / "mask_silence.wav"
+    run([
+        "ffmpeg",
+        "-y",
+        "-f",
+        "lavfi",
+        "-i",
+        "anullsrc=channel_layout=mono:sample_rate=44100",
+        "-t",
+        f"{duration:.3f}",
+        "-c:a",
+        "pcm_s16le",
+        str(silence),
+    ])
+    tone_inputs = ["-i", str(silence)]
+    filter_parts = []
+    mix_inputs = ["[0:a]"]
+    for i, segment in enumerate(segments, 1):
+        tone = workdir / f"mask_{i:04d}.wav"
+        run([
+            "ffmpeg",
+            "-y",
+            "-f",
+            "lavfi",
+            "-i",
+            "aevalsrc=1:s=44100",
+            "-t",
+            f"{segment.duration:.3f}",
+            str(tone),
+        ])
+        tone_inputs.extend(["-i", str(tone)])
+        delay_ms = int(round(segment.start * 1000))
+        filter_parts.append(f"[{i}:a]adelay={delay_ms}|{delay_ms}[m{i}]")
+        mix_inputs.append(f"[m{i}]")
+    filter_parts.append(f"{''.join(mix_inputs)}amix=inputs={len(mix_inputs)}:duration=longest:normalize=0,alimiter=limit=1[mask]")
+    run(["ffmpeg", "-y", *tone_inputs, "-filter_complex", ";".join(filter_parts), "-map", "[mask]", "-t", f"{duration:.3f}", str(mask)])
+    run([
+        "ffmpeg",
+        "-y",
+        "-i",
+        str(original_audio),
+        "-i",
+        str(mask),
+        "-filter_complex",
+        f"[0:a][1:a]sidechaincompress=threshold=0.01:ratio={1 / max(volume, 0.001):.3f}:attack=20:release=250[out]",
+        "-map",
+        "[out]",
+        "-c:a",
+        "pcm_s16le",
+        str(out_path),
+    ])
+
+
+def mix_audio(background: Path, dubbed: Path, out_path: Path, duration: float) -> None:
+    run([
+        "ffmpeg",
+        "-y",
+        "-i",
+        str(background),
+        "-i",
+        str(dubbed),
+        "-filter_complex",
+        "[0:a][1:a]amix=inputs=2:duration=longest:normalize=0,loudnorm=I=-16:TP=-1.5:LRA=11[out]",
+        "-map",
+        "[out]",
+        "-t",
+        f"{duration:.3f}",
+        "-c:a",
+        "aac",
+        "-b:a",
+        "192k",
+        str(out_path),
+    ])
+
+
+def mux_output(input_path: Path, final_audio: Path, output_path: Path, video_input: bool) -> None:
+    if video_input:
+        run([
+            "ffmpeg",
+            "-y",
+            "-i",
+            str(input_path),
+            "-i",
+            str(final_audio),
+            "-map",
+            "0:v:0",
+            "-map",
+            "1:a:0",
+            "-c:v",
+            "copy",
+            "-c:a",
+            "aac",
+            "-shortest",
+            str(output_path),
+        ])
+    else:
+        run(["ffmpeg", "-y", "-i", str(final_audio), "-c:a", "aac", str(output_path)])
+
+
+def parse_srt(text: str) -> list[Segment]:
+    blocks = re.split(r"\n\s*\n", text.strip())
+    segments: list[Segment] = []
+    for block in blocks:
+        lines = [line.strip("\ufeff ") for line in block.splitlines() if line.strip()]
+        if not lines:
+            continue
+        time_line_index = next((i for i, line in enumerate(lines) if "-->" in line), -1)
+        if time_line_index < 0:
+            continue
+        match = SRT_TIME_RE.search(lines[time_line_index])
+        if not match:
+            continue
+        body = " ".join(lines[time_line_index + 1 :]).strip()
+        if not body:
+            continue
+        index_text = lines[0] if time_line_index > 0 else str(len(segments) + 1)
+        try:
+            index = int(index_text)
+        except ValueError:
+            index = len(segments) + 1
+        segments.append(
+            Segment(
+                index=index,
+                start=srt_time_to_seconds(match.group("h"), match.group("m"), match.group("s"), match.group("ms")),
+                end=srt_time_to_seconds(match.group("eh"), match.group("em"), match.group("es"), match.group("ems")),
+                text=body,
+            )
+        )
+    return segments
+
+
+def srt_time_to_seconds(h: str, m: str, s: str, ms: str) -> float:
+    return int(h) * 3600 + int(m) * 60 + int(s) + int(ms.ljust(3, "0")[:3]) / 1000.0
+
+
+def extract_json(text: str) -> Any:
+    cleaned = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE).strip()
+    if cleaned.startswith("```"):
+        cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
+        cleaned = re.sub(r"\s*```$", "", cleaned)
+    try:
+        return json.loads(cleaned)
+    except json.JSONDecodeError:
+        start = min((p for p in [cleaned.find("{"), cleaned.find("[")] if p >= 0), default=-1)
+        end = max(cleaned.rfind("}"), cleaned.rfind("]"))
+        if start >= 0 and end > start:
+            return json.loads(cleaned[start : end + 1])
+        raise RuntimeError(f"CoderAI chat did not return JSON:\n{text[:1000]}")
+
+
+def file_data_uri(path: Path, mime: str) -> str:
+    return f"data:{mime};base64," + base64.b64encode(path.read_bytes()).decode("ascii")
+
+
+def write_api_audio_item(item: dict[str, Any], out_path: Path, session: requests.Session) -> None:
+    for key in ("b64_wav", "b64_mp3", "b64_audio", "audio"):
+        if item.get(key):
+            raw = item[key]
+            if isinstance(raw, str) and raw.startswith("data:"):
+                raw = raw.split(",", 1)[1]
+            out_path.write_bytes(base64.b64decode(raw))
+            return
+    if item.get("url"):
+        response = session.get(item["url"], timeout=7200)
+        response.raise_for_status()
+        out_path.write_bytes(response.content)
+        return
+    raise RuntimeError(f"No audio payload found in API response item: {item.keys()}")
+
+
+def choose_default_model(models: list[dict[str, Any]], capability: str) -> str | None:
+    for model in models:
+        if capability in (model.get("capabilities") or []):
+            return model.get("id")
+    return None
+
+
+def env_default(service: str, field: str, fallback: str | None = None) -> str | None:
+    prefix = SERVICE_ENV_PREFIXES[service]
+    return os.environ.get(f"{prefix}_{field}") or fallback
+
+
+def build_clients(args: argparse.Namespace) -> CoderAIClients:
+    default = CoderAIClient(args.base_url, args.api_key)
+
+    def service_client(service: str) -> CoderAIClient:
+        base_url = getattr(args, f"{service}_base_url") or args.base_url
+        api_key = getattr(args, f"{service}_api_key")
+        if api_key is None:
+            api_key = args.api_key
+        return CoderAIClient(base_url, api_key)
+
+    return CoderAIClients(
+        default=default,
+        transcribe=service_client("transcribe"),
+        text=service_client("text"),
+        voice=service_client("voice"),
+        convert=service_client("convert"),
+        stems=service_client("stems"),
+    )
+
+
+def client_label(client: CoderAIClient) -> str:
+    return client.base_url
+
+
+def assign_speakers(client: CoderAIClient, text_model: str, segments: list[Segment], max_speakers: int) -> None:
+    payload = [
+        {"id": s.index, "start": round(s.start, 3), "end": round(s.end, 3), "text": s.text}
+        for s in segments
+    ]
+    system = "You assign dialogue subtitle segments to recurring speakers. Return only JSON."
+    user = textwrap.dedent(
+        f"""
+        Assign each segment to one of at most {max_speakers} stable speaker ids.
+        Use ids like speaker_01, speaker_02. Mark singing=true when the segment appears sung, lyrical, chanted, or is likely part of music.
+        Return JSON as: {{"segments":[{{"id":1,"speaker":"speaker_01","singing":false}}]}}
+
+        Segments:
+        {json.dumps(payload, ensure_ascii=False)}
+        """
+    ).strip()
+    try:
+        data = client.chat_json(text_model, system, user, max_tokens=4096)
+        by_id = {int(item["id"]): item for item in data.get("segments", [])}
+        for segment in segments:
+            item = by_id.get(segment.index, {})
+            segment.speaker = sanitize_name(str(item.get("speaker") or segment.speaker))
+            segment.is_singing = bool(item.get("singing", False))
+    except Exception as exc:
+        print(f"warning: speaker assignment failed, using automatic speakers: {exc}", file=sys.stderr)
+        for i, segment in enumerate(segments):
+            segment.speaker = f"speaker_{(i % max(1, max_speakers)) + 1:02d}"
+
+
+def translate_segments(client: CoderAIClient, text_model: str, target_language: str, segments: list[Segment]) -> None:
+    batch_size = 40
+    system = "You translate dubbing scripts. Return only JSON."
+    for start in range(0, len(segments), batch_size):
+        batch = segments[start : start + batch_size]
+        payload = [
+            {
+                "id": s.index,
+                "source_text": s.text,
+                "duration_seconds": round(s.duration, 3),
+                "speaker": s.speaker,
+                "singing": s.is_singing,
+            }
+            for s in batch
+        ]
+        user = textwrap.dedent(
+            f"""
+            Translate each segment to {target_language} for dubbing.
+            Preserve meaning, tone, speaker intent, and song lyric style when singing=true.
+            Keep the translation speakable within the provided duration. Prefer natural lip-sync/metric fit over literal word order.
+            Return JSON as: {{"segments":[{{"id":1,"translation":"..."}}]}}
+
+            Segments:
+            {json.dumps(payload, ensure_ascii=False)}
+            """
+        ).strip()
+        data = client.chat_json(text_model, system, user, max_tokens=4096)
+        by_id = {int(item["id"]): str(item.get("translation", "")).strip() for item in data.get("segments", [])}
+        for segment in batch:
+            segment.translated = by_id.get(segment.index) or segment.text
+
+
+def fit_translation_metric(client: CoderAIClient, text_model: str, target_language: str, segments: list[Segment]) -> None:
+    system = "You adapt translated lines for dubbing timing and lip-sync. Return only JSON."
+    for segment in segments:
+        syllable_hint = max(2, int(segment.duration * 4.2))
+        user = textwrap.dedent(
+            f"""
+            Rewrite this {target_language} dub line so it fits about {segment.duration:.2f} seconds.
+            Aim for roughly {syllable_hint} syllables, preserve meaning, and keep it natural.
+            If singing is true, keep lyric rhythm and rhyme when possible.
+            Return JSON as: {{"translation":"..."}}
+
+            Original: {segment.text}
+            Current translation: {segment.translated}
+            Singing: {segment.is_singing}
+            """
+        ).strip()
+        try:
+            data = client.chat_json(text_model, system, user, max_tokens=512)
+            value = str(data.get("translation", "")).strip()
+            if value:
+                segment.translated = value
+        except Exception as exc:
+            print(f"warning: metric fitting failed for segment {segment.index}: {exc}", file=sys.stderr)
+
+
+def sanitize_name(value: str) -> str:
+    cleaned = re.sub(r"[^a-zA-Z0-9_-]+", "_", value.strip().lower())
+    return cleaned[:48] or "speaker_01"
+
+
+def create_voice_profiles(client: CoderAIClient, source_audio: Path, segments: list[Segment], workdir: Path, prefix: str) -> None:
+    by_speaker: dict[str, list[Segment]] = {}
+    for segment in segments:
+        by_speaker.setdefault(segment.speaker, []).append(segment)
+    for speaker, speaker_segments in by_speaker.items():
+        selected = sorted(speaker_segments, key=lambda s: s.duration, reverse=True)[:6]
+        start = max(0.0, selected[0].start - 0.05)
+        end = selected[0].end + 0.05
+        ref_audio = workdir / f"ref_{speaker}.wav"
+        slice_audio(source_audio, start, end, ref_audio)
+        transcript = selected[0].text.strip()
+        voice_name = sanitize_name(f"{prefix}_{speaker}")
+        print(f"creating voice profile {voice_name} from {start:.2f}-{end:.2f}s")
+        client.create_voice(voice_name, ref_audio, transcript, f"Auto-extracted by tools/video_dubber.py for {speaker}")
+        for segment in speaker_segments:
+            segment.voice_name = voice_name
+            segment.ref_audio = ref_audio
+
+
+def generate_segment_audio(
+    voice_client: CoderAIClient,
+    convert_client: CoderAIClient,
+    source_audio: Path,
+    segments: list[Segment],
+    workdir: Path,
+    max_stretch: float,
+    preserve_singing: bool,
+) -> None:
+    for n, segment in enumerate(segments, 1):
+        raw = workdir / f"dub_raw_{segment.index:04d}.wav"
+        fitted = workdir / f"dub_fit_{segment.index:04d}.wav"
+        speed = 1.0
+        if segment.translated:
+            approx_chars_per_sec = len(segment.translated) / segment.duration
+            if approx_chars_per_sec > 18:
+                speed = min(1.35, approx_chars_per_sec / 16)
+        print(f"[{n}/{len(segments)}] generating {segment.voice_name} {segment.duration:.2f}s")
+        if preserve_singing and segment.is_singing:
+            source_slice = workdir / f"sing_source_{segment.index:04d}.wav"
+            slice_audio(source_audio, segment.start, segment.end, source_slice)
+            try:
+                convert_client.convert_voice(
+                    source_slice,
+                    segment.voice_name,
+                    raw,
+                    target_voice=segment.ref_audio,
+                    f0_condition=True,
+                    length_adjust=1.0,
+                )
+            except Exception as exc:
+                print(f"warning: singing conversion failed for segment {segment.index}, falling back to cloned TTS: {exc}", file=sys.stderr)
+                voice_client.clone_voice(segment.voice_name, segment.translated or segment.text, speed, raw)
+        else:
+            voice_client.clone_voice(segment.voice_name, segment.translated or segment.text, speed, raw)
+        adjust_audio_timing(raw, segment.duration, fitted, max_stretch)
+        segment.generated_audio = fitted
+
+
+def write_artifacts(segments: list[Segment], output_base: Path) -> None:
+    json_path = output_base.with_suffix(".segments.json")
+    srt_path = output_base.with_suffix(".translated.srt")
+    json_path.write_text(
+        json.dumps([dataclasses.asdict(s) | {"ref_audio": str(s.ref_audio or ""), "generated_audio": str(s.generated_audio or "")} for s in segments], indent=2, ensure_ascii=False),
+        encoding="utf-8",
+    )
+    lines = []
+    for i, segment in enumerate(segments, 1):
+        lines.append(str(i))
+        lines.append(f"{seconds_to_srt(segment.start)} --> {seconds_to_srt(segment.end)}")
+        lines.append(segment.translated or segment.text)
+        lines.append("")
+    srt_path.write_text("\n".join(lines), encoding="utf-8")
+
+
+def seconds_to_srt(value: float) -> str:
+    value = max(0.0, value)
+    h = int(value // 3600)
+    m = int((value % 3600) // 60)
+    s = int(value % 60)
+    ms = int(round((value - math.floor(value)) * 1000))
+    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
+
+
+def parse_args(argv: Iterable[str]) -> argparse.Namespace:
+    parser = argparse.ArgumentParser(
+        description="Dub video/audio through CoderAI API while preserving music and effects.",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
+    )
+    parser.add_argument("input", type=Path, help="Input video or audio file")
+    parser.add_argument("-o", "--output", type=Path, help="Output media path")
+    parser.add_argument("-l", "--target-language", required=True, help="Target dubbing language, e.g. Italian, Spanish, ja")
+    parser.add_argument("--source-language", help="Optional source language hint for transcription")
+    parser.add_argument("--base-url", default=DEFAULT_BASE_URL, help="CoderAI API base URL")
+    parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="CoderAI bearer token; defaults to CODERAI_API_KEY")
+    parser.add_argument("--audio-model", help="CoderAI transcription model id")
+    parser.add_argument("--text-model", default=env_default("text", "MODEL"), help="CoderAI text model id for translation and dialogue analysis")
+    parser.add_argument("--transcribe-base-url", default=env_default("transcribe", "BASE_URL"), help="Override CoderAI URL for /v1/audio/transcriptions")
+    parser.add_argument("--transcribe-api-key", default=env_default("transcribe", "API_KEY"), help="Override bearer token for transcription")
+    parser.add_argument("--transcribe-model", default=env_default("transcribe", "MODEL"), help="Alias for --audio-model; defaults to CODERAI_TRANSCRIBE_MODEL")
+    parser.add_argument("--text-base-url", default=env_default("text", "BASE_URL"), help="Override CoderAI URL for /v1/chat/completions")
+    parser.add_argument("--text-api-key", default=env_default("text", "API_KEY"), help="Override bearer token for text requests")
+    parser.add_argument("--voice-base-url", default=env_default("voice", "BASE_URL"), help="Override CoderAI URL for /v1/audio/voices and /v1/audio/clone")
+    parser.add_argument("--voice-api-key", default=env_default("voice", "API_KEY"), help="Override bearer token for voice cloning/profile requests")
+    parser.add_argument("--convert-base-url", default=env_default("convert", "BASE_URL"), help="Override CoderAI URL for /v1/audio/convert")
+    parser.add_argument("--convert-api-key", default=env_default("convert", "API_KEY"), help="Override bearer token for voice conversion")
+    parser.add_argument("--stems-base-url", default=env_default("stems", "BASE_URL"), help="Override CoderAI URL for /v1/audio/stems")
+    parser.add_argument("--stems-api-key", default=env_default("stems", "API_KEY"), help="Override bearer token for stem separation")
+    parser.add_argument("--max-speakers", type=int, default=8, help="Maximum recurring speaker voices to infer")
+    parser.add_argument("--voice-prefix", default="dub", help="Prefix for saved CoderAI voice profiles")
+    parser.add_argument("--no-stems", action="store_true", help="Do not call /v1/audio/stems; use local ducking to preserve background")
+    parser.add_argument("--stem-fallback", action="store_true", help="Ask CoderAI stems endpoint to use its ffmpeg fallback mode")
+    parser.add_argument("--no-metric-fit", action="store_true", help="Skip second LLM pass for tighter metric/lip-sync adaptation")
+    parser.add_argument("--no-singing-convert", action="store_true", help="Do not use /v1/audio/convert for singing segments")
+    parser.add_argument("--duck-db", type=float, default=-14.0, help="Dialogue-region background ducking target in dB when stems are disabled/unavailable")
+    parser.add_argument("--max-stretch", type=float, default=1.35, help="Maximum local time stretch/compress factor for generated lines")
+    parser.add_argument("--keep-workdir", type=Path, help="Keep intermediate files in this directory")
+    return parser.parse_args(list(argv))
+
+
+def main(argv: Iterable[str]) -> int:
+    args = parse_args(argv)
+    require_binary("ffmpeg")
+    require_binary("ffprobe")
+    input_path = args.input.expanduser().resolve()
+    if not input_path.exists():
+        raise SystemExit(f"Input file not found: {input_path}")
+
+    video_input = is_video(input_path)
+    output_path = args.output
+    if output_path is None:
+        suffix = ".mp4" if video_input else ".m4a"
+        output_path = input_path.with_name(f"{input_path.stem}.dubbed.{sanitize_name(args.target_language)}{suffix}")
+    output_path = output_path.expanduser().resolve()
+
+    clients = build_clients(args)
+    transcribe_models = clients.transcribe.list_models()
+    text_models = clients.text.list_models()
+    audio_model = args.audio_model or args.transcribe_model or choose_default_model(transcribe_models, "audio_transcription") or "whisper"
+    text_model = args.text_model or choose_default_model(text_models, "text_generation")
+    if not text_model:
+        raise SystemExit("No text model found. Pass --text-model with a CoderAI chat model id.")
+
+    work_context = tempfile.TemporaryDirectory(prefix="coderai-dub-") if args.keep_workdir is None else None
+    workdir = args.keep_workdir or Path(work_context.name)  # type: ignore[union-attr]
+    workdir.mkdir(parents=True, exist_ok=True)
+    try:
+        source_audio = workdir / "source.wav"
+        extract_audio(input_path, source_audio)
+        total_duration = media_duration(source_audio)
+
+        print(f"transcribing with {audio_model} via {client_label(clients.transcribe)}")
+        segments = clients.transcribe.transcribe(source_audio, audio_model, args.source_language)
+        segments = [s for s in segments if s.text.strip() and s.duration >= 0.08]
+        if not segments:
+            raise RuntimeError("No dialogue segments found in the input")
+
+        print(f"assigning speakers and singing flags with {text_model} via {client_label(clients.text)}")
+        assign_speakers(clients.text, text_model, segments, args.max_speakers)
+
+        print(f"translating {len(segments)} segments to {args.target_language} via {client_label(clients.text)}")
+        translate_segments(clients.text, text_model, args.target_language, segments)
+        if not args.no_metric_fit:
+            print("fitting translated lines to segment metrics")
+            fit_translation_metric(clients.text, text_model, args.target_language, segments)
+
+        run_prefix = sanitize_name(f"{args.voice_prefix}_{input_path.stem}_{int(time.time())}")
+        print(f"creating voice profiles via {client_label(clients.voice)}")
+        create_voice_profiles(clients.voice, source_audio, segments, workdir, run_prefix)
+        generate_segment_audio(
+            clients.voice,
+            clients.convert,
+            source_audio,
+            segments,
+            workdir,
+            args.max_stretch,
+            preserve_singing=not args.no_singing_convert,
+        )
+
+        dub_track = workdir / "dub_track.wav"
+        build_dub_track(segments, total_duration, dub_track, workdir)
+
+        background = workdir / "background.wav"
+        stems = None
+        if not args.no_stems:
+            print(f"requesting CoderAI stem separation via {client_label(clients.stems)}")
+            try:
+                stems = clients.stems.separate_stems(source_audio, workdir, args.stem_fallback)
+            except Exception as exc:
+                print(f"warning: stems unavailable, using local dialogue ducking: {exc}", file=sys.stderr)
+        if stems:
+            _, instrumental = stems
+            run(["ffmpeg", "-y", "-i", str(instrumental), "-t", f"{total_duration:.3f}", "-c:a", "pcm_s16le", str(background)])
+        else:
+            duck_background(source_audio, segments, background, workdir, args.duck_db)
+
+        final_audio = workdir / "final_audio.m4a"
+        mix_audio(background, dub_track, final_audio, total_duration)
+        mux_output(input_path, final_audio, output_path, video_input)
+        write_artifacts(segments, output_path)
+
+        print(f"wrote {output_path}")
+        print(f"wrote {output_path.with_suffix('.segments.json')}")
+        print(f"wrote {output_path.with_suffix('.translated.srt')}")
+        if args.keep_workdir:
+            print(f"kept workdir {workdir}")
+        return 0
+    finally:
+        if work_context is not None:
+            work_context.cleanup()
+
+
+if __name__ == "__main__":
+    try:
+        raise SystemExit(main(sys.argv[1:]))
+    except KeyboardInterrupt:
+        raise SystemExit(130)