Backends, API, and tooling updates; gitignore township_output

- cuda/vulkan backend improvements and config plumbing
- API updates across characters, text, environments, audio, embeddings, tts
- admin chat/settings template updates
- add hf_loading helper, video request fields, platform paths
- new docs (CODERAI_API_DOCUMENTATION.md) and tools (review_outputs, video_dubber)
- ignore generated township_output/
Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
parent 7dc60f66
......@@ -26,3 +26,6 @@ test_*.py
# Local git worktrees
.worktrees/
# Generated township fighter outputs
township_output/
# AI.PROMPT coderai architecture notes for AI assistants
This file records non-obvious architecture and invariants that AI assistants
(and humans) must respect when working on coderai. Read it before changing model
loading, configuration handling, or VRAM/eviction logic.
================================================================================
## Configuration is the single source of truth
================================================================================
Every model's behaviour comes from its entry in `~/.coderai/models.json`.
Loaders MUST read settings from the per-model configuration, NOT from CLI args
(`global_args`). CLI flags are not used to configure per-model behaviour.
### Uniform models.json schema
Every model entry — regardless of type — carries the same base fields:
load_in_4bit, load_in_8bit # quantization (bitsandbytes)
flash_attention # NOTE: key is 'flash_attention', NOT 'flash_attn'
offload_strategy # 'auto' | 'none' | 'cpu' | 'sequential' | 'model' | 'disk'
offload_dir # disk-offload directory
max_gpu_percent # cap GPU budget 0-100
manual_ram_gb # explicit CPU RAM budget
no_ram # GPU-only, no CPU spill
n_ctx # context size. NOTE: key is 'n_ctx', NOT 'context_size'
n_gpu_layers
used_vram_gb # raw (full-precision) VRAM estimate
precision # 'bf16' | 'f16' | 'f32'
### Key-name traps (caused silent config-ignored bugs)
- Use `flash_attention` (legacy code wrongly read `flash_attn`)
- Use `n_ctx` (legacy code wrongly read `context_size`)
Always read the canonical key first, with the legacy key only as a fallback.
================================================================================
## Where config is translated and applied
================================================================================
### main.py :: build_kwargs_from_config(model_cfg, model_type)
Emits a COMMON BASE of kwargs (quant/offload/flash/memory/n_gpu_layers) for
EVERY type, plus type-specific extras, plus the original entry as `_raw_cfg`.
No model type may be left without the common base.
### codai/models/hf_loading.py (shared helper — transformers / HF pipeline)
Translates a model config into `from_pretrained` kwargs:
- `build_from_pretrained_kwargs(cfg)` → torch_dtype, transformers
`BitsAndBytesConfig` (4-bit nf4 / 8-bit), `device_map='auto'` + `max_memory`
(GPU→CPU split), `offload_folder` (disk overflow), `attn_implementation`.
- `build_quantization_config(cfg)` → transformers BitsAndBytesConfig or None.
- `pipeline_device_kwargs(cfg)` → kwargs for HF `pipeline(...)` (uses model_kwargs).
- `resolve_dtype(cfg, default)` → torch dtype from `precision`.
Used by: spatial (depth/segmentation), embedding, audio_gen (AudioLDM2).
### diffusers loaders (image, video, audioldm)
Two shared helpers in `codai/models/hf_loading.py`, both keyed off the per-model
`component_quantization` map (e.g. `{"transformer":"4bit","text_encoder":"8bit",
"vae":"none"}`):
- `build_pipeline_quant_config(model_name, cfg, dtype)` → diffusers
`PipelineQuantizationConfig` (`quant_mapping`) for in-place quantization.
Supported per-component modes:
* "4bit"/"8bit" → bitsandbytes (always available)
* "2bit" → optimum-quanto int2 (needs `pip install optimum-quanto`;
skipped with a warning if missing — bnb CANNOT do 2-bit)
* "none"/omitted → full precision
diffusers BnB for transformer/unet/vae; transformers BnB/Quanto for
text_encoder*. No override → global `load_in_4bit`/`8bit` on all heavy
components (MUST include UMT5 `text_encoder` for Wan2.2 or it won't fit GPU).
- `build_gguf_pipeline_components(model_name, cfg, dtype)` for any
`component_quantization` value that is a `*.gguf` path/URL, loads that
component via `<Class>.from_single_file(..., GGUFQuantizationConfig)` and
returns it to be injected as a pipeline kwarg (e.g. `transformer=<model>`).
This is the ONLY way to get 5-bit/6-bit (Q5_K/Q6_K) they don't exist in
bnb/quanto; needs a pre-quantized GGUF file + the `gguf` lib. diffusers
components only.
- Components discovered from `DiffusionPipeline.load_config(model_name)`.
- Configurable in models.json (`component_quantization`) and the Admin UI
(Per-component quantization section: Default/2-bit/4-bit/8-bit/GGUF-file/None
per component).
- bitsandbytes = 4/8-bit ONLY. 2-bit = quanto. 5/6-bit = GGUF files only.
IMPORTANT: bitsandbytes-quantized pipelines/models are placed on GPU during
`from_pretrained` and CANNOT be moved with `.to()` afterwards — load via
`device_map` and skip the explicit `.to(device)` when a quant config is active.
### text / vision
Go through `codai/backends/cuda.py :: NvidiaBackend.load_model`, which already
reads `flash_attn`, `load_in_4bit`, `load_in_8bit`, `offload_strategy`,
`max_gpu_percent`, `no_ram`, `offload_dir` from kwargs.
================================================================================
## Loading strategy (GPU-max, then spill)
================================================================================
The GPU should always be used as much as possible, then spill to CPU RAM, then
to disk. Never go straight to CPU unless explicitly configured (`no_ram=True`
or `offload_strategy='none'` means GPU-only; pure-CPU is a last resort only when
no CUDA device exists or every GPU+RAM+disk attempt failed).
GPU budget is capped at `min(total × fraction, free_vram − 512MB headroom)` so
we never request more VRAM than is physically free; overflow goes to CPU/disk
via `device_map='auto'` + `max_memory` + `offload_folder`.
bitsandbytes quantization requires CUDA; on CPU-only hosts it is skipped
gracefully. MusicGen (audiocraft `get_pretrained`) exposes no quant hook and is
the one loader left unquantized.
================================================================================
## VRAM estimate & eviction
================================================================================
### codai/models/manager.py :: _get_model_used_vram_gb
Priority: measured delta (ground truth, factors already baked in) > explicit
`used_vram_gb` (raw full-precision — quant/offload factors ARE applied on top) >
local file size > HF cache size. Quantization factor: ÷4 (4-bit), ÷2 (8-bit).
This is why a 4-bit Wan2.2 no longer reports a bogus ~151 GB.
### Universal eviction invariant
ANY model of ANY type, before loading when not already resident, must call
`manager.ensure_vram_for(model_key, resolved_name)` — which evicts other models
(one or more, LRU first) until there is enough free VRAM. This is wired into
EVERY load branch of `request_model` (per-model "on-request", per-model "load",
legacy "loadall", and the ondemand fallback). Do not add a new load path that
returns `already_loaded: False` without calling `ensure_vram_for` first — that
was the bug where a "load"-mode text model came back on CPU because the video
model that displaced it was never evicted.
### codai/models/manager.py :: _evict_models_for_vram(needed_gb)
Evicts LRU-first ONLY until `free_vram >= needed_gb`, so multiple small models
COEXIST in VRAM when they fit together. Eviction is synchronous (cleanup +
`cuda.empty_cache()` complete before returning) so load/unload happens one at a
time. Accelerate device_map models: call
`accelerate.hooks.remove_hook_from_submodules(model)` BEFORE moving to CPU,
otherwise the dispatch hooks keep CUDA tensors alive and VRAM is never freed.
NEVER evict a BUSY model (one with ref_count > 0 in its ModelInstancePool —
i.e. actively serving a request). Moving its weights off the GPU mid-forward-
pass crashes the in-flight request with a CUDA device-side assert and poisons
the context. `_evict_models_for_vram` checks `_is_key_busy(key)` and
`_wait_until_idle(key)` before evicting; busy non-active models are skipped, the
active model is waited on. `_evict_key` calls `pool.cleanup_all()` so EVERY
instance (not just the primary) is freed. Because eviction can BLOCK waiting for
a model to go idle, callers MUST invoke `request_model` off the event loop
(`await asyncio.to_thread(...)`) — text and video endpoints already do — or the
wait deadlocks the very request it is waiting on.
================================================================================
## Flash-Attention-2 requires the whole model on GPU
================================================================================
FA2 kernels are CUDA-only and assume every layer is resident on one CUDA device.
If the model is split GPU+CPU (accelerate offload), FA2 triggers a CUDA
device-side assert that corrupts the process. Therefore `backends/cuda.py` only
enables FA2 when the model fits fully in free GPU VRAM, or the user forced
full-GPU residence (`no_ram=True` or `offload_strategy='none'`). Otherwise it
falls back to `attn_implementation='sdpa'` (which handles mixed devices and still
uses flash kernels for GPU-resident layers). The manager passes
`expected_vram_gb` (= `_get_model_used_vram_gb`) into `load_model` so the backend
can make this decision. The three UI flash checkboxes (`flash_attention`,
`sdcpp_flash_attn`, `sdcpp_diffusion_flash_attn`) are OR'd into FA2 intent for
transformers models, since the sdcpp flags are no-ops for HF models.
================================================================================
## CUDA device-side assert poisons the whole process fail fast
================================================================================
A CUDA "device-side assert triggered" / "illegal memory access" / "CUDA error"
corrupts the CUDA context PROCESS-WIDE. Every subsequent GPU op then fails with
the same async assert echo. It is UNRECOVERABLE in-process the server must be
restarted. `MultiModelManager._mark_cuda_poisoned_if_fatal(err)` sets
`cuda_context_poisoned`; `request_model`, the text retry loop, and the video
endpoint all check it and fail fast with HTTP 503 + "Restart coderai to recover"
instead of retrying dozens of times. Do NOT add retry loops that re-load onto a
poisoned context.
================================================================================
## VRAM is not freed after eviction without expandable_segments
================================================================================
`codai/__init__.py` sets `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
BEFORE torch is imported (honouring any pre-set value). This is REQUIRED, not
optional. Symptom without it: after evicting a model, params move to CPU
(`memory_allocated` drops) but `torch.cuda.memory_reserved()` stays high and
`torch.cuda.empty_cache()` frees almost nothing VRAM stays ~full and the next
model can't load. Root cause: HF/accelerate keeps the tied embedding/lm_head
weight (a single ~2 GB live tensor) in `tied_params_map`; the default CUDA
allocator cannot return a segment that contains ANY live block, so that one
tensor pins the whole ~18 GB segment. expandable_segments lets the allocator
return the freed pages around it. Do not remove this env var; do not set CUDA
config after torch initializes (it is read once at first CUDA use).
`NvidiaBackend.cleanup` also: removes accelerate hooks, walks every submodule
moving raw `_parameters`/`_buffers` to CPU (model.to('cpu') is a silent no-op on
dispatched models), and breaks lingering list/dict references to this model's
GPU tensors (scoped by storage data_ptr so coexisting models are untouched).
================================================================================
## Chat generation: turn-boundary stop + enable_thinking (model-agnostic)
================================================================================
This is NOT a Qwen-only server. Keep model-specific handling minimal and
detection-based. Two helpers in `backends/cuda.py` handle chat generation
generically:
- `_eos_token_ids()` returns eos_token_id PLUS any known turn-end token that
ACTUALLY EXISTS in this model's vocab (<|im_end|>, <|eot_id|>, <|end|>,
<end_of_turn>, …). Each is added only if `convert_tokens_to_ids` returns a
real (non-unk) id, so a Llama/Mistral/etc. model just doesn't get <|im_end|>.
Without this, models whose turn ends with <|im_end|> (not the eos
<|endoftext|>) never stop and hallucinate extra assistant/user turns.
- `_build_chat_prompt(messages, enable_thinking, add_generation_prompt)` uses
the MODEL'S OWN `tokenizer.apply_chat_template` when it has one (correct
special tokens + proper enable_thinking handling), falling back to the legacy
formatter otherwise. `enable_thinking` is passed to the template only if it
accepts the kwarg (TypeError → retry without), so non-thinking models are
unaffected. enable_thinking threads from the request:
text.py(reasoning_enabled) → ModelManager.generate_chat[_stream] →
NvidiaBackend. Default False (suppress reasoning); True keeps <think> blocks
for callers that ask. Do NOT hardcode model-family tokens in generation paths
— gate on vocab presence / template support.
================================================================================
## Thermal protection (model-agnostic, config-driven)
================================================================================
A long sequence of heavy generations can drive CPU/GPU hot enough that the
machine's own protection powers it off. `codai/models/thermal.py` guards against
this: before serving a request against a loaded model it waits until temps are
safe.
- Single choke point: `MultiModelManager.request_model()` calls
`thermal.wait_until_safe()` right after the CUDA-poison check, so EVERY request
type (text/image/video/audio/tts/embedding/spatial) is covered once.
- Mid-generation checkpoints (`thermal.checkpoint(context, throttle_seconds)`)
pause long runs that overheat AFTER the pre-request check passed: diffusion
step callbacks (`_vid_step_cb` / image `_step_cb`) call it per denoise step;
HF text generation adds a `StoppingCriteria` (via `_make_thermal_criteria()`
in backends/cuda.py) that runs ON the generate thread blocking the streamer
CONSUMER loop would NOT pause the GPU (generation runs in a separate thread).
GGUF/llama.cpp text (backends/vulkan.py) uses the same idea via a llama.cpp
`StoppingCriteriaList` (`_make_llama_thermal_criteria()`) passed to every
create_(chat_)completion call llama.cpp evaluates it synchronously per token.
Throttled (2 s) for high-frequency token loops; unthrottled for per-step.
- Config lives in config.json `thermal` (ThermalConfig): cpu_enabled, gpu_enabled
(default True each), cpu_high/cpu_resume, gpu_high/gpu_resume (default 90/87),
poll_seconds (default 5). Editable live in the admin Settings page saving
pushes values onto global_args so no restart is needed.
- Hysteresis: pause when temp >= *high*, resume only once temp <= *resume*
(resume < high). A sensor that can't be read is treated as safe (never blocks).
- Readers: GPU via nvidia-smi (then rocm-smi, then psutil amdgpu); CPU via psutil
(k10temp/coretemp), then /sys/class/thermal, then `sensors`. 2s reading cache.
- ASYNC REQUIREMENT: the wait is a blocking time.sleep, so request_model MUST
always be invoked via `asyncio.to_thread(...)` from async endpoints — never
call it directly in an async handler, or the cooldown stalls the event loop
and the whole server stops accepting requests. All api/*.py call sites already
do this; keep it that way for any new endpoint.
================================================================================
## Invariants checklist when adding/altering a model loader
================================================================================
1. Read settings from the per-model config, never from CLI/global_args.
2. Use the canonical keys (`flash_attention`, `n_ctx`).
3. transformers/HF-pipeline model → use codai/models/hf_loading helpers.
diffusers model → use PipelineQuantizationConfig.
4. Honour quantization, offload_strategy, offload_dir, max_gpu_percent, no_ram,
manual_ram_gb, precision, flash_attention.
5. Do not `.to()` a quantized model/pipeline; place it via device_map.
6. Maximize GPU first, then CPU RAM, then disk; CPU-only is last resort.
7. Never break VRAM coexistence — evict only the minimum needed.
8. FA2 only when the model fits fully on GPU; else use SDPA (offload-safe).
9. On a CUDA device-side assert, fail fast (503) — never retry onto a poisoned
context; the server must be restarted.
10. Keep `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` (set in
codai/__init__.py before torch import) — without it, evicted-model VRAM is
never returned to the driver.
11. Eviction must free CPU RAM too, not just VRAM: cleanup calls
`_trim_cpu_ram()` (glibc malloc_trim) so the evicted model's host-side copy /
offloaded weights are returned to the OS and swap is reclaimed otherwise
RSS creeps up across evict/load cycles.
12. CPU threads are capped to HALF the cores (when >= 8) in codai/__init__.py
(OMP/MKL/OpenBLAS env before torch import) so model loading / 4-bit dequant
never saturates the machine. Do NOT lower sys.setswitchinterval during long
loads it caused GIL scheduler thrashing (load avg > 10).
13. request_model() must ALWAYS be called via asyncio.to_thread from async
endpoints it can block (thermal cooldown, waiting for a busy model). A
direct call stalls the event loop.
14. Thermal protection is config-driven and model-agnostic (config.json
`thermal`). Don't special-case it per model/backend; it only reads temps and
sleeps. Honour the enable flags and high/resume hysteresis.
# CoderAI API Documentation
This document describes the full HTTP API exposed by CoderAI, including OpenAI-compatible endpoints, native multimodal endpoints, profile/LoRA APIs, pipelines, admin APIs, examples, and end-to-end workflows.
The API is implemented with FastAPI in `codai/api/app.py` and routers under `codai/api/`, with admin routes under `codai/admin/routes.py`.
## Base URL
Default local server:
```text
http://127.0.0.1:8776
```
Most client calls use the `/v1` prefix:
```text
http://127.0.0.1:8776/v1
```
## Authentication
CoderAI supports web sessions and API bearer tokens.
For `/v1/*` routes, send:
```http
Authorization: Bearer <api-token>
```
Token management is available in the admin UI and admin API:
- `GET /admin/tokens`
- `GET /admin/api/tokens`
- `POST /admin/api/tokens`
Notes:
- `/v1/images/progress` is explicitly exempt from bearer auth in middleware.
- If the admin/session manager is not initialized, API auth can be bypassed by the server.
- Admin HTML/API routes use signed session cookies; many admin API routes require an admin role.
- Some profile routes also enforce local API auth internally.
Example reusable shell variables:
```bash
export CODERAI_URL="http://127.0.0.1:8776"
export CODERAI_TOKEN="your-api-token"
```
Example JSON request:
```bash
curl -s "$CODERAI_URL/v1/models" \
-H "Authorization: Bearer $CODERAI_TOKEN"
```
## Common Data Conventions
### Media Inputs
Media fields usually accept either:
- A URL: `http://...`, `https://...`, or a CoderAI file URL such as `/v1/files/output.png`
- Raw base64 without a data URL prefix
- Data URLs such as `data:image/png;base64,...`, `data:video/mp4;base64,...`, `data:audio/wav;base64,...`
### Media Outputs
Generation endpoints typically return:
```json
{
"created": 1781090000,
"data": [
{
"url": "/v1/files/generated.png"
}
]
}
```
If `response_format` requests base64, the first data item uses a media-specific key:
- Images: `b64_json`
- Video: `b64_mp4`
- Audio: `b64_wav` or `b64_mp3`
### Progress Polling
Long-running image, video, audio, and LoRA jobs expose polling endpoints. Typical progress response:
```json
{
"current": 12,
"total": 30,
"active": true,
"phase": "generating",
"model": "model-id",
"pct": 40.0,
"it_per_s": 1.3,
"elapsed": 8.9
}
```
### Extra Fields
Most request models allow extra JSON fields (`extra="allow"`). This makes the API tolerant of OpenAI-compatible or Studio-style client parameters even when a specific route ignores them.
## Core Endpoints
### List Models
`GET /v1/models`
Returns configured models and metadata.
Response shape:
```json
{
"object": "list",
"data": [
{
"id": "Qwen/Qwen3-8B",
"object": "model",
"created": 1781090000,
"owned_by": "huggingface",
"type": "text",
"capabilities": ["text_generation"],
"backend": "cuda",
"model_path": "Qwen/Qwen3-8B",
"alias": "qwen3"
}
]
}
```
Example:
```bash
curl -s "$CODERAI_URL/v1/models" \
-H "Authorization: Bearer $CODERAI_TOKEN" | jq
```
### Capabilities Document
`GET /coderai/capabilities`
Returns CoderAI broker/studio capability metadata and hardware summary. This endpoint is used by AISBF and discovery integrations.
Example:
```bash
curl -s "$CODERAI_URL/coderai/capabilities" | jq
```
### Serve Generated Files
`GET /v1/files/{filename}`
Returns a generated or uploaded file from the configured output directory. Path traversal is rejected.
Example:
```bash
curl -L "$CODERAI_URL/v1/files/generated.png" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-o generated.png
```
### File Archive
`GET /v1/archive`
Lists generated media in the output/archive directory.
```json
{
"files": [
{
"filename": "image_001.png",
"type": "image",
"size": 123456,
"created": 1781090000,
"url": "/v1/files/image_001.png"
}
]
}
```
`DELETE /v1/archive/{filename}` deletes an archived file.
```bash
curl -X DELETE "$CODERAI_URL/v1/archive/image_001.png" \
-H "Authorization: Bearer $CODERAI_TOKEN"
```
## Text Generation
CoderAI exposes OpenAI-compatible chat and legacy completion APIs.
### Chat Completions
`POST /v1/chat/completions`
Request fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `model` | string | required | Model id from `/v1/models` |
| `messages` | array | required | Chat messages with `role` and `content` |
| `temperature` | number | `0.7` | Sampling temperature |
| `top_p` | number | `1.0` | Nucleus sampling |
| `n` | integer | `1` | Number of completions |
| `max_tokens` | integer/null | `null` | Max generated tokens |
| `stream` | boolean | `false` | Return SSE chunks |
| `stop` | string/array/null | `null` | Stop sequence(s) |
| `presence_penalty` | number | `0.0` | OpenAI-compatible field |
| `frequency_penalty` | number | `0.0` | OpenAI-compatible field |
| `repeat_penalty` | number | `1.0` | Repetition penalty |
| `tools` | array/null | `null` | Function/tool definitions |
| `tool_choice` | string/object/null | `auto` | Tool selection control |
| `enable_thinking` | boolean | `false` | Enables reasoning/thinking templates where supported |
| `response_format` | object/null | `null` | Accepted for compatibility |
Basic request:
```bash
curl -s "$CODERAI_URL/v1/chat/completions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "Explain VRAM offloading in one paragraph."}
],
"temperature": 0.4,
"max_tokens": 300
}' | jq
```
Response shape:
```json
{
"id": "chatcmpl-...",
"object": "chat.completion",
"created": 1781090000,
"model": "Qwen/Qwen3-8B",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "VRAM offloading..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 42,
"completion_tokens": 80,
"total_tokens": 122
}
}
```
Streaming request:
```bash
curl -N "$CODERAI_URL/v1/chat/completions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "Write a haiku about GPUs."}],
"stream": true
}'
```
Streaming responses use server-sent event style lines:
```text
data: {"id":"chatcmpl-...","object":"chat.completion.chunk","choices":[...]}
data: [DONE]
```
Tool calling example:
```bash
curl -s "$CODERAI_URL/v1/chat/completions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "What is the weather in Rome?"}],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}'
```
### Legacy Completions
`POST /v1/completions`
Request fields are similar to OpenAI legacy completions:
| Field | Type | Default |
|---|---:|---:|
| `model` | string | required |
| `prompt` | string or string[] | required |
| `temperature` | number | `0.7` |
| `top_p` | number | `1.0` |
| `n` | integer | `1` |
| `max_tokens` | integer/null | `null` |
| `stream` | boolean | `false` |
| `stop` | string/array/null | `null` |
| `repeat_penalty` | number | `1.0` |
Example:
```bash
curl -s "$CODERAI_URL/v1/completions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-8B",
"prompt": "The fastest way to reduce inference memory is",
"max_tokens": 120
}' | jq
```
## Images
### Image Progress
`GET /v1/images/progress`
Returns the current image-generation progress. This route is exempt from bearer auth in middleware.
```bash
curl -s "$CODERAI_URL/v1/images/progress" | jq
```
### Generate Images
`POST /v1/images/generations`
Request fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `model` | string | required | Image model id |
| `prompt` | string | required | Positive prompt |
| `n` | integer | `1` | Number of images |
| `size` | string | `1024x1024` | Output size |
| `steps` | integer/null | model default | Inference steps |
| `guidance_scale` | number/null | model default | CFG/guidance |
| `quality` | string | `standard` | Compatibility field |
| `style` | string/null | `null` | Compatibility/style field |
| `response_format` | string | `url` | `url` or `b64_json` |
| `seed` | integer/null | random | Deterministic seed |
| `negative_prompt` | string/null | `null` | Negative prompt |
| `disable_safety_checker` | boolean | `false` | Disable safety checker where supported |
| `vae_model` | string/null | `null` | Per-request VAE override |
| `loras` | array/null | `null` | LoRA adapters `{model, weight, name}` |
| `character_profiles` | string[]/null | `null` | Saved character profile names |
| `character_references` | string[]/null | `null` | Inline reference images |
| `character_strength` | number | `0.6` | IP-Adapter/reference strength |
| `environment_profiles` | string[]/null | `null` | Saved environment profile names |
Example:
```bash
curl -s "$CODERAI_URL/v1/images/generations" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "stabilityai/stable-diffusion-xl-base-1.0",
"prompt": "cinematic photo of a brass robot botanist in a glass greenhouse, morning mist",
"negative_prompt": "blurry, low quality, distorted hands",
"size": "1024x1024",
"steps": 30,
"guidance_scale": 7.0,
"seed": 12345,
"response_format": "url"
}' | jq
```
LoRA example:
```json
{
"model": "image-model",
"prompt": "portrait of <character-token> as a space pilot",
"loras": [
{"model": "/home/me/loras/space_uniform.safetensors", "weight": 0.8, "name": "uniform"}
]
}
```
Character/environment consistency example:
```json
{
"model": "image-model",
"prompt": "Alice explores the old library at sunset",
"character_profiles": ["Alice"],
"environment_profiles": ["OldLibrary"],
"character_strength": 0.75,
"size": "1024x1024"
}
```
### Edit Image
`POST /v1/images/edits`
Fields:
- `model` required
- `prompt` required
- `image` required, base64/URL source image
- `mask` optional
- `n`, `size`, `response_format`, `strength`, `steps`, `guidance_scale`, `seed`, `quality`
```bash
curl -s "$CODERAI_URL/v1/images/edits" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "image-edit-model",
"image": "data:image/png;base64,...",
"prompt": "turn the sky into dramatic storm clouds",
"strength": 0.55,
"response_format": "url"
}'
```
### Inpaint Image
`POST /v1/images/inpaint`
Like edits, but `mask` is required.
```json
{
"model": "inpaint-model",
"image": "data:image/png;base64,...",
"mask": "data:image/png;base64,...",
"prompt": "replace the masked area with a carved wooden door",
"strength": 0.99,
"steps": 30,
"response_format": "url"
}
```
### Upscale Image
`POST /v1/images/upscale`
```json
{
"model": "realesrgan-x4plus",
"image": "data:image/png;base64,...",
"scale": 4,
"response_format": "url"
}
```
### Depth Map
`POST /v1/images/depth`
```json
{
"model": "depth-anything",
"image": "data:image/png;base64,...",
"response_format": "url"
}
```
### Segment Image
`POST /v1/images/segment`
```json
{
"model": "sam-vit-h",
"image": "data:image/png;base64,...",
"points": [[420, 300]],
"boxes": [[100, 100, 600, 700]],
"response_format": "url"
}
```
### Deblur Image
`POST /v1/images/deblur`
```json
{
"image": "data:image/png;base64,...",
"strength": 0.5,
"response_format": "url"
}
```
### Unpixelate Image
`POST /v1/images/unpixelate`
```json
{
"model": "realesrgan-x4plus",
"image": "data:image/png;base64,...",
"scale": 4,
"response_format": "url"
}
```
### Outfit Change
`POST /v1/images/outfit`
Fields:
- `model` required
- `image` or `video` optional input
- `prompt` required outfit/clothing description
- `negative_prompt`, `mask`, `steps`, `guidance_scale`, `strength`, `seed`, `response_format`
```json
{
"model": "inpaint-model",
"image": "data:image/png;base64,...",
"prompt": "tailored navy velvet evening suit with silver embroidery",
"negative_prompt": "distorted body, extra limbs",
"steps": 30,
"guidance_scale": 7.5,
"strength": 0.92,
"response_format": "url"
}
```
### Face Swap
`POST /v1/images/faceswap`
```json
{
"source_face": "data:image/png;base64,...",
"target": "data:image/png;base64,...",
"target_type": "image",
"response_format": "url"
}
```
For video targets, use `target_type: "video"`.
## Video
### Video Progress
`GET /v1/video/progress`
```bash
curl -s "$CODERAI_URL/v1/video/progress" \
-H "Authorization: Bearer $CODERAI_TOKEN" | jq
```
### Generate Video
`POST /v1/video/generations`
Primary fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `model` | string | required | Video model id |
| `prompt` | string | `""` | Text prompt |
| `negative_prompt` | string/null | `null` | Negative prompt |
| `width` | integer | `512` | Width |
| `height` | integer | `512` | Height |
| `num_frames` | integer/null | model default | Frame count |
| `fps` | integer/null | model default | Frames per second |
| `num_inference_steps` | integer/null | model default | Diffusion steps |
| `guidance_scale` | number/null | model default | CFG/guidance |
| `seed` | integer/null | random | Seed |
| `mode` | string | `t2v` | `t2v`, `i2v`, `v2v`, `ti2v`, `interp` |
| `image` / `init_image` | string/null | `null` | Initial/reference frame |
| `end_image` | string/null | `null` | End frame for interpolation |
| `video` | string/null | `null` | Input video for v2v/post-processing |
| `strength` | number/null | `null` | Denoising strength |
| `camera_motion` | string/null | `null` | `zoom-in`, `pan-left`, etc. |
| `character_profiles` | string[]/null | `null` | Saved character profiles |
| `loras` | array/null | `null` | Video LoRA adapters |
| `response_format` | string | `url` | `url` or `b64_mp4` |
Text-to-video example:
```bash
curl -s "$CODERAI_URL/v1/video/generations" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "video-model",
"mode": "t2v",
"prompt": "a slow dolly shot through a neon market in the rain",
"negative_prompt": "low quality, flicker",
"width": 768,
"height": 432,
"num_frames": 49,
"fps": 12,
"num_inference_steps": 30,
"guidance_scale": 6.0,
"seed": 9001,
"response_format": "url"
}' | jq
```
Image-to-video example:
```json
{
"model": "i2v-model",
"mode": "i2v",
"prompt": "gentle camera push-in, hair and fabric moving in the wind",
"init_image": "data:image/png;base64,...",
"num_frames": 32,
"fps": 8,
"camera_motion": "zoom-in",
"response_format": "url"
}
```
Video with generated audio, subtitles, dub, and post-processing:
```json
{
"model": "video-model",
"prompt": "a robot chef prepares pasta in a futuristic kitchen",
"mode": "t2v",
"num_frames": 49,
"fps": 12,
"add_audio": true,
"audio_type": "ambient",
"audio_prompt": "soft kitchen ambience, gentle synth pad",
"generate_subtitles": true,
"burn_subtitles": true,
"subtitle_style": "minimal",
"upscale_output": true,
"upscale_factor": 2,
"interpolate_output": true,
"fps_multiplier": 2,
"response_format": "url"
}
```
Multi-character dialog example:
```json
{
"model": "video-model",
"prompt": "two detectives talk in a dim archive room",
"character_profiles": ["DetectiveA", "DetectiveB"],
"dialogs": [
{"character": "DetectiveA", "voice": "narrator_a", "text": "The file was never missing.", "lip_sync": true},
{"character": "DetectiveB", "voice": "narrator_b", "text": "Then someone wanted us to think it was.", "lip_sync": true}
],
"burn_subtitles": true,
"response_format": "url"
}
```
### Upscale Video
`POST /v1/video/upscale`
```json
{
"model": "realesrgan-video",
"video": "data:video/mp4;base64,...",
"upscale_factor": 2,
"response_format": "url"
}
```
### Subtitle Video
`POST /v1/video/subtitle`
```json
{
"model": "whisper-large-v3",
"video": "data:video/mp4;base64,...",
"language": "en",
"translate": true,
"target_lang": "it",
"burn": false,
"style": "default",
"response_format": "srt"
}
```
`response_format` can be `srt`, `vtt`, `json`, or `burned_video`.
### Interpolate Video or Frames
`POST /v1/video/interpolate`
```json
{
"model": "rife",
"video": "data:video/mp4;base64,...",
"fps_multiplier": 2,
"response_format": "url"
}
```
Frame interpolation:
```json
{
"model": "rife",
"init_image": "data:image/png;base64,...",
"end_image": "data:image/png;base64,...",
"fps_multiplier": 4,
"response_format": "url"
}
```
### Dub Video
`POST /v1/video/dub`
```json
{
"model": "whisper-large-v3",
"video": "data:video/mp4;base64,...",
"source_lang": "en",
"target_lang": "es",
"voice_clone": true,
"burn_subtitles": true,
"response_format": "url"
}
```
## Audio
### Transcriptions
`POST /v1/audio/transcriptions`
This is an OpenAI-style multipart form endpoint.
Form fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `model` | string | required | Whisper/transcription model |
| `file` | file | required | Audio/video file upload |
| `language` | string/null | `null` | Language hint |
| `prompt` | string/null | `null` | Context prompt |
| `response_format` | string | `json` | `json`, `verbose_json`, `text`, `srt`, `vtt` |
| `temperature` | number | `0.0` | Decoding temperature |
Example:
```bash
curl -s "$CODERAI_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-F model="whisper-large-v3" \
-F file=@speech.wav \
-F language="en" \
-F response_format="json" | jq
```
Text-only response:
```bash
curl -s "$CODERAI_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-F model="whisper-large-v3" \
-F file=@speech.wav \
-F response_format="text"
```
### Text-to-Speech
`POST /v1/audio/speech`
Request fields:
- `model` required
- `input` required text
- `voice` default `af_sarah`
- `response_format` default `mp3`
- `speed` default `1.0`
- `voice_profile` optional saved profile name
Response:
```json
{
"audio": "<base64-audio>"
}
```
Example:
```bash
curl -s "$CODERAI_URL/v1/audio/speech" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "kokoro",
"input": "Local inference is online.",
"voice": "af_sarah",
"response_format": "mp3",
"speed": 1.0
}' | jq -r .audio | base64 -d > speech.mp3
```
### Audio Generation Progress
`GET /v1/audio/progress`
```bash
curl -s "$CODERAI_URL/v1/audio/progress" \
-H "Authorization: Bearer $CODERAI_TOKEN" | jq
```
### Generate Audio / Music / SFX
`POST /v1/audio/generate`
Request fields:
| Field | Type | Default |
|---|---:|---:|
| `model` | string | required |
| `prompt` | string | required |
| `duration` | number | `10.0` |
| `top_k` | integer | `250` |
| `top_p` | number | `0.0` |
| `temperature` | number | `1.0` |
| `cfg_coef` | number | `3.0` |
| `seed` | integer/null | `null` |
| `melody` | string/null | `null` |
| `voice_profile` | string/null | `null` |
| `response_format` | string | `url` |
Example:
```json
{
"model": "facebook/musicgen-medium",
"prompt": "warm lo-fi loop with brushed drums and soft Rhodes chords",
"duration": 12,
"temperature": 1.0,
"cfg_coef": 3.0,
"seed": 44,
"response_format": "url"
}
```
Melody-conditioned example:
```json
{
"model": "facebook/musicgen-melody",
"prompt": "cinematic orchestral arrangement of the melody",
"melody": "data:audio/wav;base64,...",
"duration": 20,
"response_format": "url"
}
```
### Voice Profiles
List voices:
`GET /v1/audio/voices`
Create voice profile:
`POST /v1/audio/voices`
Multipart fields:
- `name`
- `transcript`
- `description`
- `audio` file
```bash
curl -s "$CODERAI_URL/v1/audio/voices" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-F name="narrator_a" \
-F transcript="This is the exact reference transcript." \
-F description="Warm narrator voice" \
-F audio=@reference.wav | jq
```
Get, patch, delete:
- `GET /v1/audio/voices/{name}`
- `PATCH /v1/audio/voices/{name}`
- `DELETE /v1/audio/voices/{name}`
Extract a voice profile from audio or video:
`POST /v1/audio/voices/extract`
```json
{
"name": "speaker_from_clip",
"description": "Extracted from interview clip",
"video": "data:video/mp4;base64,...",
"transcript": "Optional exact transcript for the selected speech segment."
}
```
### Voice Clone
`POST /v1/audio/clone`
Fields:
- `text` required output text
- `voice_name` optional saved profile
- `ref_audio` and `ref_text` optional inline reference
- `speed`, `seed`, `response_format`
Using saved voice:
```json
{
"text": "The archive doors opened at midnight.",
"voice_name": "narrator_a",
"speed": 0.95,
"seed": 10,
"response_format": "url"
}
```
Using inline reference:
```json
{
"text": "The system is ready.",
"ref_audio": "data:audio/wav;base64,...",
"ref_text": "This is the reference speaker transcript.",
"response_format": "b64_wav"
}
```
### Voice Conversion
`POST /v1/audio/convert`
Fields:
- `source_audio` required
- `target_voice` or `voice_name` optional
- `f0_condition` singing-mode pitch conditioning
- `pitch_shift`
- `diffusion_steps`
- `length_adjust`
- `inference_cfg_rate`
- `response_format`
```json
{
"source_audio": "data:audio/wav;base64,...",
"voice_name": "singer_a",
"f0_condition": true,
"pitch_shift": 0,
"diffusion_steps": 20,
"response_format": "url"
}
```
### Audio Stems
`POST /v1/audio/stems`
```json
{
"audio": "data:audio/wav;base64,...",
"stem_mode": "vocals-instrumental",
"response_format": "url",
"fallback_mode": true
}
```
Supported requested split modes include:
- `vocals-instrumental`
- `4-stem`
- `drums-bass-other`
### Audio Cleanup
`POST /v1/audio/cleanup`
```json
{
"audio": "data:audio/wav;base64,...",
"noise_reduction": true,
"normalize": true,
"remove_hum": true,
"repair_clicks": false,
"response_format": "url",
"fallback_mode": true
}
```
## Embeddings
`POST /v1/embeddings`
Request fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `model` | string | required | Embedding model |
| `input` | string/string[] | required | Text input(s) |
| `image` | string/string[]/null | `null` | Optional image input(s) for multimodal embeddings |
| `encoding_format` | string | `float` | `float` or `base64` |
| `dimensions` | integer/null | `null` | Optional truncation size |
Example:
```bash
curl -s "$CODERAI_URL/v1/embeddings" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-small-en-v1.5",
"input": ["first document", "second document"],
"encoding_format": "float"
}' | jq
```
Response shape:
```json
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [0.01, -0.02]},
{"object": "embedding", "index": 1, "embedding": [0.03, 0.04]}
],
"model": "BAAI/bge-small-en-v1.5",
"usage": {"prompt_tokens": 4, "total_tokens": 4}
}
```
Multimodal embedding example:
```json
{
"model": "clip-embedding-model",
"input": "a red sports car",
"image": "data:image/png;base64,...",
"encoding_format": "base64"
}
```
## Character Profiles
Character profiles are named collections of reference images used for visual identity consistency in image/video generation.
### Create or Replace Character
`POST /v1/characters`
```json
{
"name": "Alice",
"description": "Short-haired detective in a charcoal coat",
"images": [
{"label": "front", "data": "data:image/png;base64,..."},
{"label": "side", "data": "data:image/png;base64,..."}
]
}
```
Response:
```json
{"ok": true, "name": "Alice", "image_count": 2}
```
### List Characters
`GET /v1/characters`
```json
{
"characters": [
{"name": "Alice", "description": "...", "image_count": 2, "created_at": 1781090000}
]
}
```
### Get Character
`GET /v1/characters/{name}`
Returns profile metadata plus base64 images.
### Patch Character
`PATCH /v1/characters/{name}`
```json
{
"description": "Updated description",
"add_images": [{"label": "close-up", "data": "data:image/png;base64,..."}],
"remove_indices": [0]
}
```
### Delete Character
`DELETE /v1/characters/{name}`
### Generate Character References
`POST /v1/characters/generate`
Generates reference images from text and saves them as a profile.
```json
{
"name": "CaptainNova",
"description": "A calm starship captain",
"prompt": "consistent character sheet, woman starship captain, front and side views, clean studio lighting",
"model": "image-model",
"n": 4,
"steps": 30,
"width": 768,
"height": 768
}
```
### Extract Character from Media
`POST /v1/characters/extract`
```json
{
"name": "InterviewGuest",
"description": "Face crops extracted from source video",
"videos": ["data:video/mp4;base64,..."],
"max_images": 5
}
```
## Environment Profiles
Environment profiles are named collections of reference images used to condition scene/background style.
Routes mirror character profiles:
- `POST /v1/environments`
- `GET /v1/environments`
- `GET /v1/environments/{name}`
- `PATCH /v1/environments/{name}`
- `DELETE /v1/environments/{name}`
- `POST /v1/environments/generate`
- `POST /v1/environments/extract`
Create example:
```json
{
"name": "OldLibrary",
"description": "Warm wood, tall shelves, dust in sunset beams",
"images": [
{"label": "wide", "data": "data:image/png;base64,..."}
]
}
```
Generate example:
```json
{
"name": "MarsHangar",
"description": "Industrial red planet aircraft hangar",
"prompt": "wide cinematic environment concept art of a Mars aircraft hangar, dust, red light, realistic",
"model": "image-model",
"n": 4,
"width": 1024,
"height": 768
}
```
Use in generation:
```json
{
"model": "image-model",
"prompt": "Alice stands beside a parked rover",
"character_profiles": ["Alice"],
"environment_profiles": ["MarsHangar"]
}
```
## LoRA Training and Registry
### Train LoRA
`POST /v1/loras/train`
Request fields:
| Field | Type | Default | Description |
|---|---:|---:|---|
| `name` | string | required | LoRA name |
| `base_model` | string | required | Base model to train against |
| `train_base_model` | string/null | `null` | Optional training model override |
| `target` | string | `image` | `image` or `video` |
| `quantize_4bit` | boolean | `true` | Quantized training where supported |
| `num_frames` | integer | `1` | Video/frame setting |
| `character` | string/null | `null` | Use saved character profile |
| `environment` | string/null | `null` | Use saved environment profile |
| `images` | string[]/null | `null` | Inline training images |
| `instance_prompt` | string/null | `null` | Instance prompt/token |
| `steps` | integer | `800` | Training steps |
| `rank` | integer | `16` | LoRA rank |
| `learning_rate` | number | `0.0001` | LR |
| `resolution` | integer | `512` | Training resolution |
| `seed` | integer | `42` | Seed |
Example:
```json
{
"name": "alice_identity",
"base_model": "image-model",
"target": "image",
"character": "Alice",
"instance_prompt": "photo of alice_person",
"steps": 800,
"rank": 16,
"learning_rate": 0.0001,
"resolution": 768
}
```
Training is blocking and queued one-at-a-time.
### LoRA Progress
`GET /v1/loras/progress`
```bash
curl -s "$CODERAI_URL/v1/loras/progress" \
-H "Authorization: Bearer $CODERAI_TOKEN" | jq
```
### LoRA Registry
- `GET /v1/loras`
- `GET /v1/loras/{name}`
- `DELETE /v1/loras/{name}`
Use a trained LoRA in image/video requests:
```json
{
"model": "image-model",
"prompt": "alice_person in a cyberpunk alley",
"loras": [{"model": "alice_identity", "weight": 0.85}]
}
```
## 2D / 3D / Spatial APIs
### Image to 3D
`POST /v1/images/to3d`
```json
{
"image": "data:image/png;base64,...",
"method": "mesh",
"max_shift": 20,
"response_format": "url"
}
```
`method` can include `stereo`, `anaglyph`, `depth`, or `mesh`.
### 3D to Image
`POST /v1/images/from3d`
```json
{
"model_data": "data:model/gltf-binary;base64,...",
"format": "glb",
"camera_distance": 2.0,
"camera_elevation": 30,
"camera_azimuth": 45,
"width": 768,
"height": 768,
"response_format": "url"
}
```
### Video to 3D
`POST /v1/video/to3d`
```json
{
"video": "data:video/mp4;base64,...",
"method": "anaglyph",
"max_shift": 15,
"response_format": "url"
}
```
### 3D to Video
`POST /v1/video/from3d`
```json
{
"model_data": "data:model/gltf-binary;base64,...",
"format": "glb",
"frames": 36,
"fps": 12,
"camera_elevation": 20,
"camera_distance": 2.5,
"width": 768,
"height": 768,
"response_format": "url"
}
```
### Generate 3D Model
`POST /v1/3d/generate`
```json
{
"prompt": "a stylized low-poly red dragon statue",
"model": "3d-model",
"steps": 64,
"seed": 42,
"response_format": "url"
}
```
Image-conditioned 3D generation:
```json
{
"image": "data:image/png;base64,...",
"model": "triposr",
"response_format": "url"
}
```
## Built-In Pipelines
Pipelines chain existing endpoints server-side and aggregate `steps` and `data`.
Implementation caveat: `codai/api/pipelines.py` currently imports video helpers named `create_video_generation` and `create_video_dub`, while `codai/api/video.py` defines the route handlers as `video_generations` and `video_dub`. If those aliases are not added elsewhere at runtime, built-in video pipeline calls can fail even though the routes are registered. The lower-level video endpoints documented above are the canonical API surface.
### Image to Video Pipeline
`POST /v1/pipelines/image-to-video`
Steps:
1. Generate image with `image_model`
2. Animate it with `video_model`
3. Optionally add audio and upscale
```json
{
"prompt": "a lonely lighthouse under aurora lights, cinematic",
"image_model": "image-model",
"video_model": "video-model",
"image_size": "1024x1024",
"image_steps": 30,
"image_cfg": 7.0,
"image_seed": 100,
"num_frames": 32,
"fps": 8,
"num_inference_steps": 25,
"guidance_scale": 6.5,
"camera_motion": "zoom-in",
"add_audio": true,
"audio_type": "ambient",
"audio_prompt": "distant waves, soft wind",
"upscale_output": true,
"response_format": "url"
}
```
### Video Dub Pipeline
`POST /v1/pipelines/video-dub`
```json
{
"model": "whisper-large-v3",
"video": "data:video/mp4;base64,...",
"source_lang": "en",
"target_lang": "de",
"voice_clone": true,
"burn_subtitles": true,
"response_format": "url"
}
```
### Story Pipeline
`POST /v1/pipelines/story`
Steps:
1. LLM writes visual scene descriptions
2. Image model generates scene images
3. Video model animates the first scene
4. Optional TTS narration
```json
{
"story": "A courier robot crosses a flooded city to deliver a seed vault key.",
"text_model": "Qwen/Qwen3-8B",
"image_model": "image-model",
"video_model": "video-model",
"tts_model": "kokoro",
"tts_voice": "af_sarah",
"num_scenes": 4,
"num_frames": 32,
"fps": 8,
"response_format": "url"
}
```
### Audio Dub Pipeline
`POST /v1/pipelines/audio-dub`
Steps:
1. Transcribe source audio/video
2. Optionally translate transcript
3. Synthesize dubbed audio with voice cloning
4. If input is video, replace audio track
```json
{
"video": "data:video/mp4;base64,...",
"voice_name": "narrator_a",
"source_lang": "en",
"target_lang": "fr",
"whisper_model": "whisper-large-v3",
"speed": 1.0,
"burn_subtitles": true,
"response_format": "url"
}
```
## Custom Pipelines
Custom pipelines let clients define reusable multi-step workflows with template variables.
Implementation caveat: custom pipeline execution calls each handler with `(request, http_request)`. Some handlers in `codai/api/` accept only the request object, so step types whose handlers do not accept an HTTP request may need handler signature adjustments before they run reliably. Treat `/v1/pipelines/step-types` as the server's advertised builder schema and validate complex custom pipelines in your deployment.
### List Custom Pipelines
`GET /v1/pipelines/custom`
### List Step Types
`GET /v1/pipelines/step-types`
Supported step types include:
- `text_gen`
- `image_gen`
- `image_edit`
- `image_inpaint`
- `image_upscale`
- `image_deblur`
- `image_unpix`
- `image_outfit`
- `image_faceswap`
- `video_gen`
- `video_upscale`
- `video_sub`
- `video_interp`
- `video_dub`
- `tts`
- `stt`
- `audio_gen`
- `voice_clone`
- `voice_convert`
Template variables:
- `{{input}}` - pipeline runtime input
- `{{stepN.output}}` - extracted text/base output from step N
- `{{stepN.url}}` - first URL output from step N
- `{{stepN.<field>}}` - any extracted field from step N
### Create Custom Pipeline
`POST /v1/pipelines/custom`
```json
{
"id": "poster-to-trailer",
"name": "Poster to Trailer",
"description": "Generate a poster concept, animate it, then create music.",
"steps": [
{
"type": "text_gen",
"label": "Write visual prompt",
"params": {
"model": "Qwen/Qwen3-8B",
"system": "Write vivid visual prompts only.",
"prompt": "Turn this idea into a cinematic image prompt: {{input}}"
}
},
{
"type": "image_gen",
"label": "Generate poster",
"params": {
"model": "image-model",
"prompt": "{{step0.output}}",
"size": "1024x1024"
}
},
{
"type": "video_gen",
"label": "Animate poster",
"params": {
"model": "video-model",
"mode": "i2v",
"prompt": "{{step0.output}}, slow cinematic movement",
"init_image": "{{step1.url}}",
"num_frames": 32,
"fps": 8
}
},
{
"type": "audio_gen",
"label": "Create soundtrack",
"params": {
"model": "musicgen",
"prompt": "epic short trailer music for: {{input}}",
"duration": 12
},
"continue_on_error": true
}
]
}
```
### Update and Delete
- `PUT /v1/pipelines/custom/{pipeline_id}`
- `DELETE /v1/pipelines/custom/{pipeline_id}`
### Run Saved Pipeline
`POST /v1/pipelines/custom/{pipeline_id}/run`
```json
{
"input": "a solar-powered train crossing the Sahara at night"
}
```
### Run Inline Pipeline
`POST /v1/pipelines/run`
Sends a `PipelineDefinition` directly without saving. The current implementation executes with an empty `{{input}}`, so include static params or use saved pipeline run when runtime input is required.
### Audio Understanding Pipeline
`POST /v1/pipelines/audio-understand`
Transcribes audio, then optionally asks a text model to summarize or reason over it.
```json
{
"audio": "data:audio/wav;base64,...",
"audio_model": "whisper-large-v3",
"text_model": "Qwen/Qwen3-8B",
"input": "Summarize action items and decisions.",
"language": "en"
}
```
### Audio Music Dub Pipeline
`POST /v1/pipelines/audio-music-dub`
Current implementation returns a structured workflow with placeholder stages for stems, translation/adaptation, voice conversion, and remix.
```json
{
"audio": "data:audio/wav;base64,...",
"audio_model": "whisper-large-v3",
"target_lang": "it",
"source_lang": "en",
"notes": "Preserve rhyme and chorus structure."
}
```
## Admin HTML Routes
Admin pages are session-cookie based.
| Method | Path | Purpose | Auth |
|---|---|---|---|
| `GET` | `/login` | Login page | Public |
| `POST` | `/login` | Login form | Public |
| `GET` | `/logout` | Logout | Optional session |
| `GET` | `/admin/change-password` | Password change page | Logged-in |
| `POST` | `/admin/change-password` | Change password | Logged-in |
| `GET` | `/admin` | Dashboard | Logged-in |
| `GET` | `/admin/models` | Model management page | Admin |
| `GET` | `/admin/tokens` | Token page | Admin |
| `GET` | `/admin/users` | User page | Admin |
| `GET` | `/chat` | Chat UI | Logged-in |
| `GET` | `/admin/settings` | Settings page | Admin |
| `GET` | `/admin/archive` | Archive page | Admin |
Static assets are mounted under `/static/admin/*`.
## Admin API
Admin APIs usually require a valid session cookie and admin role unless noted.
### Status, Users, Tokens
| Method | Path | Body/Query | Purpose |
|---|---|---|---|
| `GET` | `/admin/api/status` | none | System, model, VRAM, queue, recent activity status |
| `POST` | `/admin/api/users` | `{username,password,role}` | Create user |
| `DELETE` | `/admin/api/users/{user_id}` | path | Delete user |
| `GET` | `/admin/api/tokens` | none | List API tokens |
| `POST` | `/admin/api/tokens` | `{name, provider?}` | Create token |
| `DELETE` | `/admin/api/tokens/{token_id}` | path | Delete token |
| `POST` | `/admin/api/system/reload` | none | Reload config/system state |
Create token example after logging in with a session cookie:
```bash
curl -s "$CODERAI_URL/admin/api/tokens" \
-b cookies.txt \
-H "Content-Type: application/json" \
-d '{"name":"automation","provider":"local"}' | jq
```
### Model and Cache Management
| Method | Path | Body/Query | Purpose |
|---|---|---|---|
| `GET` | `/admin/api/models` | none | List configured models |
| `POST` | `/admin/api/model-download` | `{model_id,file_pattern?}` | Start Hugging Face download |
| `GET` | `/admin/api/download-stream/{session_id}` | path | SSE download progress |
| `GET` | `/admin/api/downloads` | none | Active/recent downloads |
| `POST` | `/admin/api/download-cancel/{session_id}` | path | Cancel download |
| `POST` | `/admin/api/model-upload` | multipart chunk | Chunked model upload |
| `DELETE` | `/admin/api/models/{model_identifier}` | path | Remove cached model |
| `GET` | `/admin/api/hf-files` | `repo_id` | List HF repo files |
| `GET` | `/admin/api/cached-models` | none | Local cache inventory |
| `GET` | `/admin/api/cache-stats` | none | Disk/cache stats |
| `DELETE` | `/admin/api/cache` | `cache_type=all|hf|gguf` | Clear cache |
| `DELETE` | `/admin/api/cached-models/{model_id:path}` | `cache_type` | Delete cached model |
| `POST` | `/admin/api/model-enable` | `{path|model_id,model_type}` | Enable model in config |
| `POST` | `/admin/api/model-disable` | `{path|model_id,config_id?}` | Disable model |
| `GET` | `/admin/api/model-loaded-status` | none | Loaded model / pool info |
| `POST` | `/admin/api/model-load` | `{path}` | Load model now |
| `POST` | `/admin/api/model-unload` | `{path}` | Unload model |
| `POST` | `/admin/api/model-configure` | model config JSON | Configure model |
Download with SSE progress:
```bash
SESSION_ID=$(curl -s "$CODERAI_URL/admin/api/model-download" \
-b cookies.txt \
-H "Content-Type: application/json" \
-d '{"model_id":"Qwen/Qwen3-8B"}' | jq -r .session_id)
curl -N "$CODERAI_URL/admin/api/download-stream/$SESSION_ID" -b cookies.txt
```
SSE events include `progress`, `done`, `error`, and `keepalive`.
### Settings and Archive Admin
| Method | Path | Body/Query | Purpose |
|---|---|---|---|
| `GET` | `/admin/api/settings` | none | Current config sections |
| `POST` | `/admin/api/settings` | partial settings JSON | Save settings |
| `GET` | `/admin/api/archive` | `limit`, `offset` | List archive entries |
| `GET` | `/admin/api/archive/{gen_id}` | path | Archive entry detail |
| `DELETE` | `/admin/api/archive/{gen_id}` | path | Delete archive entry |
| `GET` | `/admin/api/archive/{gen_id}/files/{filename}` | path | Download archive file |
| `GET` | `/admin/api/archive-settings` | none | Archive config and retention options |
Settings include server/backend/model/offload/vulkan/archive/thermal/broker/parser/system-prompt sections.
### Hugging Face Search and Metadata
| Method | Path | Query | Purpose |
|---|---|---|---|
| `GET` | `/admin/api/hf-search` | `q`, `gguf_mode`, `pipeline_tag`, `sort`, `sizes`, `arch`, `capabilities`, `component_type` | Search models |
| `GET` | `/admin/api/hf-model-files` | `model_id` | List GGUF/model files with size/quant metadata |
| `GET` | `/admin/api/hf-model-info` | `model_id` | Full HF model metadata summary |
Example:
```bash
curl -s "$CODERAI_URL/admin/api/hf-search?q=whisper&capabilities=speech_to_text" \
-b cookies.txt | jq
```
### Admin Profile Proxies
Logged-in users can access profile metadata through admin routes:
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/admin/api/characters` | List characters |
| `GET` | `/admin/api/characters/{name}` | Character detail |
| `GET` | `/admin/api/characters/{name}/thumbnail` | Character thumbnail |
| `DELETE` | `/admin/api/characters/{name}` | Delete character |
| `GET` | `/admin/api/environments` | List environments |
| `GET` | `/admin/api/environments/{name}` | Environment detail |
| `GET` | `/admin/api/environments/{name}/thumbnail` | Environment thumbnail |
| `DELETE` | `/admin/api/environments/{name}` | Delete environment |
| `GET` | `/admin/api/voices` | List voice profiles |
| `GET` | `/admin/api/voices/{name}` | Voice detail |
| `DELETE` | `/admin/api/voices/{name}` | Delete voice |
## AISBF / Broker Integration
CoderAI exposes:
- `GET /coderai/capabilities`
- OpenAI-compatible `/v1/models` and `/v1/chat/completions`
- Native `/v1/*` endpoints that can be proxied by AISBF
AISBF broker mode uses outbound WebSocket connections from CoderAI to AISBF for NAT traversal. The canonical broker protocol is documented in `coderai-broker-implementation-reference.md`.
Global-scope broker URL template:
```text
wss://<aisbf-host>/api/coderai/wss?provider_id=<provider_id>&client_id=<client_id>&username=global&registration_token=<token>
```
User-scope broker URL template:
```text
wss://<aisbf-host>/api/u/<username>/coderai/wss?provider_id=<provider_id>&client_id=<client_id>&username=<username>&registration_token=<token>
```
Important broker fields:
- `provider_id` identifies the AISBF provider configuration.
- `client_id` must be stable and match the provider config.
- `username` is `global` or the AISBF username for user-scoped providers.
- `registration_token` is provider-scoped and required for admission.
AISBF can call operations such as `models.list`, `chat.completions`, `capabilities`, `register`, and `proxy`. Proxy operations can forward headers, query params, multipart form payloads, binary/base64 bodies, progress polling endpoints, and streaming envelopes.
## Error Handling
Common HTTP status codes:
| Status | Meaning |
|---:|---|
| `400` | Invalid request, missing required media, or incompatible fields |
| `401` | Missing/invalid token or session |
| `403` | Forbidden, unsafe file path, or insufficient role |
| `404` | Model, profile, file, pipeline, or archive entry not found |
| `422` | Validation error for strict fields |
| `429` | Rate limit or queue saturation |
| `500` | Generation/backend failure |
| `501` | Optional backend not installed |
| `503` | Model/backend unavailable or CUDA context poisoned |
Typical auth error:
```json
{
"detail": {
"message": "Invalid API key. Provide a valid Bearer token.",
"type": "invalid_request_error",
"code": "invalid_api_key"
}
}
```
If a CUDA device-side assert or illegal memory access poisons the context, CoderAI fails fast with a `503` instructing that the process must be restarted.
## Complex Workflows
### Workflow 1: Consistent Character Image and Video
Goal: create a character, generate a scene image using that identity, then animate it.
1. Create character profile:
```bash
curl -s "$CODERAI_URL/v1/characters" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name":"Alice",
"description":"Detective with short black hair and charcoal coat",
"images":[{"label":"front","data":"data:image/png;base64,..."}]
}'
```
2. Generate an image with the profile:
```bash
IMAGE_URL=$(curl -s "$CODERAI_URL/v1/images/generations" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model":"image-model",
"prompt":"Alice in a rainy neon alley, cinematic detective noir",
"character_profiles":["Alice"],
"character_strength":0.75,
"size":"1024x1024",
"response_format":"url"
}' | jq -r '.data[0].url')
```
3. Animate the image:
```bash
curl -s "$CODERAI_URL/v1/video/generations" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d "{
\"model\":\"video-model\",
\"mode\":\"i2v\",
\"prompt\":\"Alice looks up as rain falls, subtle camera push-in\",
\"init_image\":\"$IMAGE_URL\",
\"num_frames\":32,
\"fps\":8,
\"camera_motion\":\"zoom-in\",
\"response_format\":\"url\"
}" | jq
```
### Workflow 2: Full Story Generation
Use the built-in story pipeline to generate a script, scene images, a short video, and narration.
```bash
curl -s "$CODERAI_URL/v1/pipelines/story" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"story":"A botanist finds a singing plant inside a crashed satellite.",
"text_model":"Qwen/Qwen3-8B",
"image_model":"image-model",
"video_model":"video-model",
"tts_model":"kokoro",
"tts_voice":"af_sarah",
"num_scenes":4,
"num_frames":32,
"fps":8,
"response_format":"url"
}' | jq
```
Output includes:
- `steps[0].text` generated scene script
- `steps[1].urls` generated images
- `data[0].video_url`
- `data[0].audio_url`
### Workflow 3: Multilingual Video Dubbing
1. Upload or encode the source video as a data URL.
2. Call the video dub pipeline.
3. Poll `/v1/video/progress` if needed.
4. Download output from returned URL.
```bash
curl -s "$CODERAI_URL/v1/pipelines/video-dub" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model":"whisper-large-v3",
"video":"data:video/mp4;base64,...",
"source_lang":"en",
"target_lang":"ja",
"voice_clone":true,
"burn_subtitles":true,
"response_format":"url"
}' | jq
```
For lower-level control, use:
- `POST /v1/video/subtitle`
- `POST /v1/audio/clone`
- `POST /v1/video/dub`
### Workflow 4: Audio Meeting Summary
Transcribe a meeting and summarize action items with a text model.
```bash
curl -s "$CODERAI_URL/v1/pipelines/audio-understand" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"audio":"data:audio/wav;base64,...",
"audio_model":"whisper-large-v3",
"text_model":"Qwen/Qwen3-8B",
"language":"en",
"input":"Extract decisions, owners, deadlines, and unresolved questions."
}' | jq
```
### Workflow 5: Train and Apply a Character LoRA
1. Build a character profile:
```json
{
"name": "Mira",
"description": "Explorer with copper curls and a green field jacket",
"images": [{"label": "front", "data": "data:image/png;base64,..."}]
}
```
2. Train LoRA:
```bash
curl -s "$CODERAI_URL/v1/loras/train" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"name":"mira_lora",
"base_model":"image-model",
"target":"image",
"character":"Mira",
"instance_prompt":"photo of mira_person",
"steps":800,
"rank":16,
"resolution":768
}' | jq
```
3. Poll progress:
```bash
watch -n 2 "curl -s '$CODERAI_URL/v1/loras/progress' -H 'Authorization: Bearer $CODERAI_TOKEN' | jq"
```
4. Generate with LoRA:
```json
{
"model": "image-model",
"prompt": "photo of mira_person exploring alien ruins, cinematic backlight",
"loras": [{"model": "mira_lora", "weight": 0.8}],
"response_format": "url"
}
```
### Workflow 6: Custom Pipeline for Automated Media Asset Creation
Create a reusable pipeline that converts a product idea into a slogan, hero image, promo video, and voiceover.
```bash
curl -s "$CODERAI_URL/v1/pipelines/custom" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"id":"product-media-kit",
"name":"Product Media Kit",
"description":"Slogan, image, video, and voiceover for a product concept.",
"steps":[
{
"type":"text_gen",
"label":"Write slogan and image prompt",
"params":{
"model":"Qwen/Qwen3-8B",
"system":"Return a concise slogan, then a vivid image prompt.",
"prompt":"Product concept: {{input}}"
}
},
{
"type":"image_gen",
"label":"Hero image",
"params":{
"model":"image-model",
"prompt":"{{step0.output}}",
"size":"1024x1024",
"response_format":"url"
}
},
{
"type":"video_gen",
"label":"Promo animation",
"params":{
"model":"video-model",
"mode":"i2v",
"prompt":"premium product commercial, elegant camera motion, {{step0.output}}",
"init_image":"{{step1.url}}",
"num_frames":32,
"fps":8,
"response_format":"url"
}
},
{
"type":"tts",
"label":"Voiceover",
"params":{
"model":"kokoro",
"input":"{{step0.output}}",
"voice":"af_sarah",
"speed":1.0
},
"continue_on_error":true
}
]
}' | jq
curl -s "$CODERAI_URL/v1/pipelines/custom/product-media-kit/run" \
-H "Authorization: Bearer $CODERAI_TOKEN" \
-H "Content-Type: application/json" \
-d '{"input":"A compact solar charger for hikers and emergency kits"}' | jq
```
## Practical Client Patterns
### Polling Progress While a Job Runs
Use a second terminal while a generation request is running:
```bash
while true; do
curl -s "$CODERAI_URL/v1/video/progress" \
-H "Authorization: Bearer $CODERAI_TOKEN" | jq -c
sleep 2
done
```
### Python Chat Client
```python
import requests
base = "http://127.0.0.1:8776"
token = "your-api-token"
resp = requests.post(
f"{base}/v1/chat/completions",
headers={"Authorization": f"Bearer {token}"},
json={
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "Write a CLI release note."}],
"temperature": 0.3,
},
timeout=300,
)
resp.raise_for_status()
print(resp.json()["choices"][0]["message"]["content"])
```
### Python Streaming Chat Client
```python
import json
import requests
base = "http://127.0.0.1:8776"
token = "your-api-token"
with requests.post(
f"{base}/v1/chat/completions",
headers={"Authorization": f"Bearer {token}"},
json={
"model": "Qwen/Qwen3-8B",
"messages": [{"role": "user", "content": "Count to five slowly."}],
"stream": True,
},
stream=True,
timeout=300,
) as r:
r.raise_for_status()
for line in r.iter_lines(decode_unicode=True):
if not line or not line.startswith("data: "):
continue
payload = line[6:]
if payload == "[DONE]":
break
event = json.loads(payload)
delta = event["choices"][0].get("delta", {})
print(delta.get("content", ""), end="", flush=True)
```
### OpenAI Python SDK Compatibility
For OpenAI-compatible text routes:
```python
from openai import OpenAI
client = OpenAI(
base_url="http://127.0.0.1:8776/v1",
api_key="your-api-token",
)
response = client.chat.completions.create(
model="Qwen/Qwen3-8B",
messages=[{"role": "user", "content": "Explain local model routing."}],
)
print(response.choices[0].message.content)
```
## Endpoint Index
### Public `/v1` and Discovery
| Method | Path |
|---|---|
| `GET` | `/v1/models` |
| `GET` | `/coderai/capabilities` |
| `GET` | `/v1/files/{filename}` |
| `GET` | `/v1/archive` |
| `DELETE` | `/v1/archive/{filename}` |
| `POST` | `/v1/chat/completions` |
| `POST` | `/v1/completions` |
| `GET` | `/v1/images/progress` |
| `POST` | `/v1/images/generations` |
| `POST` | `/v1/images/edits` |
| `POST` | `/v1/images/inpaint` |
| `POST` | `/v1/images/upscale` |
| `POST` | `/v1/images/depth` |
| `POST` | `/v1/images/segment` |
| `POST` | `/v1/images/deblur` |
| `POST` | `/v1/images/unpixelate` |
| `POST` | `/v1/images/outfit` |
| `POST` | `/v1/images/faceswap` |
| `GET` | `/v1/video/progress` |
| `POST` | `/v1/video/generations` |
| `POST` | `/v1/video/upscale` |
| `POST` | `/v1/video/subtitle` |
| `POST` | `/v1/video/interpolate` |
| `POST` | `/v1/video/dub` |
| `POST` | `/v1/audio/transcriptions` |
| `POST` | `/v1/audio/speech` |
| `GET` | `/v1/audio/progress` |
| `POST` | `/v1/audio/generate` |
| `GET` | `/v1/audio/voices` |
| `POST` | `/v1/audio/voices` |
| `GET` | `/v1/audio/voices/{name}` |
| `PATCH` | `/v1/audio/voices/{name}` |
| `DELETE` | `/v1/audio/voices/{name}` |
| `POST` | `/v1/audio/voices/extract` |
| `POST` | `/v1/audio/clone` |
| `POST` | `/v1/audio/convert` |
| `POST` | `/v1/audio/stems` |
| `POST` | `/v1/audio/cleanup` |
| `POST` | `/v1/embeddings` |
| `POST` | `/v1/characters` |
| `GET` | `/v1/characters` |
| `GET` | `/v1/characters/{name}` |
| `PATCH` | `/v1/characters/{name}` |
| `DELETE` | `/v1/characters/{name}` |
| `POST` | `/v1/characters/generate` |
| `POST` | `/v1/characters/extract` |
| `POST` | `/v1/environments` |
| `GET` | `/v1/environments` |
| `GET` | `/v1/environments/{name}` |
| `PATCH` | `/v1/environments/{name}` |
| `DELETE` | `/v1/environments/{name}` |
| `POST` | `/v1/environments/generate` |
| `POST` | `/v1/environments/extract` |
| `POST` | `/v1/loras/train` |
| `GET` | `/v1/loras/progress` |
| `GET` | `/v1/loras` |
| `GET` | `/v1/loras/{name}` |
| `DELETE` | `/v1/loras/{name}` |
| `POST` | `/v1/images/to3d` |
| `POST` | `/v1/images/from3d` |
| `POST` | `/v1/video/to3d` |
| `POST` | `/v1/video/from3d` |
| `POST` | `/v1/3d/generate` |
| `POST` | `/v1/pipelines/image-to-video` |
| `POST` | `/v1/pipelines/video-dub` |
| `POST` | `/v1/pipelines/story` |
| `POST` | `/v1/pipelines/audio-dub` |
| `GET` | `/v1/pipelines/custom` |
| `GET` | `/v1/pipelines/step-types` |
| `POST` | `/v1/pipelines/custom` |
| `PUT` | `/v1/pipelines/custom/{pipeline_id}` |
| `DELETE` | `/v1/pipelines/custom/{pipeline_id}` |
| `POST` | `/v1/pipelines/custom/{pipeline_id}/run` |
| `POST` | `/v1/pipelines/run` |
| `POST` | `/v1/pipelines/audio-understand` |
| `POST` | `/v1/pipelines/audio-music-dub` |
### Admin API
| Method | Path |
|---|---|
| `GET` | `/admin/api/status` |
| `POST` | `/admin/api/users` |
| `DELETE` | `/admin/api/users/{user_id}` |
| `GET` | `/admin/api/tokens` |
| `POST` | `/admin/api/tokens` |
| `DELETE` | `/admin/api/tokens/{token_id}` |
| `GET` | `/admin/api/models` |
| `POST` | `/admin/api/model-download` |
| `GET` | `/admin/api/download-stream/{session_id}` |
| `GET` | `/admin/api/downloads` |
| `POST` | `/admin/api/download-cancel/{session_id}` |
| `POST` | `/admin/api/model-upload` |
| `DELETE` | `/admin/api/models/{model_identifier}` |
| `GET` | `/admin/api/hf-files` |
| `GET` | `/admin/api/cached-models` |
| `GET` | `/admin/api/cache-stats` |
| `DELETE` | `/admin/api/cache` |
| `DELETE` | `/admin/api/cached-models/{model_id:path}` |
| `POST` | `/admin/api/model-enable` |
| `POST` | `/admin/api/model-disable` |
| `GET` | `/admin/api/model-loaded-status` |
| `POST` | `/admin/api/model-load` |
| `POST` | `/admin/api/model-unload` |
| `POST` | `/admin/api/model-configure` |
| `POST` | `/admin/api/system/reload` |
| `GET` | `/admin/api/settings` |
| `POST` | `/admin/api/settings` |
| `GET` | `/admin/api/archive` |
| `GET` | `/admin/api/archive/{gen_id}` |
| `DELETE` | `/admin/api/archive/{gen_id}` |
| `GET` | `/admin/api/archive/{gen_id}/files/{filename}` |
| `GET` | `/admin/api/archive-settings` |
| `GET` | `/admin/api/hf-search` |
| `GET` | `/admin/api/hf-model-files` |
| `GET` | `/admin/api/hf-model-info` |
| `GET` | `/admin/api/characters` |
| `GET` | `/admin/api/characters/{name}` |
| `GET` | `/admin/api/characters/{name}/thumbnail` |
| `DELETE` | `/admin/api/characters/{name}` |
| `GET` | `/admin/api/environments` |
| `GET` | `/admin/api/environments/{name}` |
| `GET` | `/admin/api/environments/{name}/thumbnail` |
| `DELETE` | `/admin/api/environments/{name}` |
| `GET` | `/admin/api/voices` |
| `GET` | `/admin/api/voices/{name}` |
| `DELETE` | `/admin/api/voices/{name}` |
......@@ -173,6 +173,16 @@ if [ "$BACKEND" = "nvidia" ]; then
echo -e "${YELLOW}Note: audiocraft not installed (audio generation with MusicGen optional)${NC}"
}
# Optional quantization backends for diffusers image/video pipelines:
# optimum-quanto -> enables 2-bit (int2) per-component quantization
# gguf -> enables loading GGUF-quantized components (Q5_K/Q6_K, etc.)
# bitsandbytes (4-bit/8-bit) comes via requirements-nvidia.txt; these add the
# extra widths that bitsandbytes cannot do.
echo -e "${YELLOW}Installing optional quantization backends (2-bit / GGUF)...${NC}"
pip install optimum-quanto gguf || {
echo -e "${YELLOW}Note: optimum-quanto/gguf not installed (2-bit and GGUF 5/6-bit quantization optional)${NC}"
}
# Install Flash Attention 2 if requested
if [ "$FLASH" = true ]; then
echo ""
......
......@@ -14,6 +14,63 @@
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# Configure the CUDA caching allocator BEFORE torch is imported anywhere.
# expandable_segments lets the allocator return freed pages to the driver even
# from partially-used segments. Without it, a single small live tensor (e.g. a
# tied embedding weight) pins an entire large segment, so torch.cuda.empty_cache()
# cannot release the GBs of already-freed weights around it after a model is
# evicted — VRAM stays occupied and the next model can't load. Honour any value
# the user already set.
import os as _os
_alloc_conf = _os.environ.get("PYTORCH_CUDA_ALLOC_CONF", "")
if "expandable_segments" not in _alloc_conf:
_os.environ["PYTORCH_CUDA_ALLOC_CONF"] = (
(_alloc_conf + ",") if _alloc_conf else ""
) + "expandable_segments:True"
# Cap CPU threads BEFORE torch / OpenMP / MKL initialise. Loading and 4-bit
# dequantising large models is CPU-heavy; left uncapped, torch/OpenMP grab every
# core and the machine's load average spikes and it becomes sluggish. On boxes
# with >= 8 cores, limit to HALF the cores so model loads never saturate the
# machine. Smaller machines keep the default (don't cripple them). Honour any
# value the user already set.
try:
_ncpu = _os.cpu_count() or 0
if _ncpu >= 8:
_cap = str(max(1, _ncpu // 2))
for _var in ("OMP_NUM_THREADS", "MKL_NUM_THREADS", "OPENBLAS_NUM_THREADS",
"NUMEXPR_NUM_THREADS", "VECLIB_MAXIMUM_THREADS"):
_os.environ.setdefault(_var, _cap)
except Exception:
pass
# Silence ONE specific upstream FutureWarning from bitsandbytes' quant kernels:
# bitsandbytes/backends/cuda/ops.py: torch._check_is_size(blocksize)
# bitsandbytes (latest, 0.49.2) still calls the deprecated torch._check_is_size
# on bleeding-edge torch. We don't call it ourselves and can't fix their source,
# so suppress just this message (not warnings in general) to keep logs readable.
import warnings as _warnings
_warnings.filterwarnings(
"ignore",
message=r".*_check_is_size will be removed.*",
category=FutureWarning,
)
# More upstream / diagnostic-only noise we can't fix from here:
# - huggingface_hub: diffusers/transformers pass the deprecated
# `local_dir_use_symlinks` kwarg to hf_hub_download (not our code).
# - torch.distributed.reduce_op: emitted while the debug leak-scanner walks
# gc.get_objects(); unavoidable without dropping the scan.
_warnings.filterwarnings(
"ignore",
message=r".*local_dir_use_symlinks.*",
category=UserWarning,
)
_warnings.filterwarnings(
"ignore",
message=r".*reduce_op.*is deprecated.*",
category=FutureWarning,
)
# codai module - AI model parsing utilities
from .models.parser import (
ModelParserDispatcher,
......
......@@ -15,8 +15,9 @@
.sidebar {
width:220px; min-width:180px; background:var(--surface-1);
border-right:1px solid var(--border); display:flex; flex-direction:column;
overflow:hidden; flex-shrink:0;
overflow:hidden; flex-shrink:0; transition:width .15s, min-width .15s;
}
.sidebar.hidden { width:0; min-width:0; border-right:none; overflow:hidden; }
.sidebar-hd { padding:.6rem 1rem .15rem; font-size:10px; font-weight:700;
color:var(--text-3); letter-spacing:.07em; text-transform:uppercase; }
.model-list { flex:1; overflow-y:auto; padding:.2rem .4rem .5rem; }
......@@ -242,6 +243,52 @@ a.dl { display:inline-block; margin-top:.4rem; }
.req-preview-actions { display:flex; gap:.4rem; flex-wrap:wrap; align-items:center; }
.req-preview-status { font-size:11px; color:var(--text-3); min-height:14px; }
/* ── Model pick block (Studio per-panel selectors) ───────────── */
.model-pick-block {
background:var(--surface-2); border:1px solid var(--border); border-radius:8px;
padding:.6rem .75rem; display:flex; flex-direction:column; gap:.4rem; margin-bottom:.6rem;
}
.model-pick-title {
font-size:10px; font-weight:700; letter-spacing:.07em; text-transform:uppercase;
color:var(--text-3); margin-bottom:.1rem;
}
.model-pick-row { display:flex; align-items:center; gap:.5rem; }
.model-pick-role {
font-size:11px; color:var(--text-2); min-width:7.5rem; flex-shrink:0; line-height:1.3;
}
.model-pick-sel {
flex:1; padding:.35rem .5rem; border:1px solid var(--border); border-radius:6px;
background:var(--surface-1); color:var(--text-1); font-size:12px; cursor:pointer;
min-width:0;
}
.model-pick-sel:focus { outline:2px solid var(--accent); outline-offset:1px; }
.model-pick-hint {
font-size:10px; color:var(--text-3); display:flex; align-items:center; gap:.55rem; flex-wrap:wrap;
}
.mp-ok { color:#4ade80; }
.mp-warn { color:#f0c060; }
/* VAE / LoRA optional section */
.mp-extra { border-top:1px solid var(--border); margin-top:.3rem; padding-top:.4rem; }
.mp-extra summary {
font-size:11px; color:var(--text-2); cursor:pointer; user-select:none;
list-style:none; display:flex; align-items:center; gap:.3rem;
}
.mp-extra summary::-webkit-details-marker { display:none; }
.mp-extra summary::before { content:'▶'; font-size:9px; transition:transform .15s; }
.mp-extra[open] summary::before { transform:rotate(90deg); }
.mp-extra-body { display:flex; flex-direction:column; gap:.4rem; margin-top:.45rem; }
.lora-entry { display:flex; align-items:center; gap:.35rem; }
.lora-weight {
width:4.5rem; flex-shrink:0; padding:.3rem .4rem; border:1px solid var(--border);
border-radius:5px; background:var(--surface-1); color:var(--text-1); font-size:12px;
}
.lora-remove {
width:1.5rem; height:1.5rem; flex-shrink:0; display:flex; align-items:center; justify-content:center;
border:none; background:transparent; color:var(--text-3); cursor:pointer; font-size:13px;
line-height:1; padding:0; border-radius:4px;
}
.lora-remove:hover { background:var(--surface-3); color:var(--text-1); }
/* ── Diagnostics / history ────────────────────────────────────── */
.diag-card, .hist-card {
border:1px solid var(--border); background:var(--surface-1); border-radius:8px;
......@@ -358,6 +405,19 @@ a.dl { display:inline-block; margin-top:.4rem; }
.prof-voice-actions { display:flex; gap:.4rem; margin-top:.5rem; }
/* ── Role-picker popup ────────────────────────────────────────── */
.role-picker-popup { position:fixed; z-index:9999; background:var(--surface-1); border:1px solid var(--border); border-radius:8px; padding:.7rem; box-shadow:0 8px 24px rgba(0,0,0,.5); min-width:200px; max-width:280px; }
/* ── Profile viewer modal ─────────────────────────────────────── */
.prof-modal-backdrop { position:fixed; inset:0; background:rgba(0,0,0,.6); z-index:10000; display:flex; align-items:center; justify-content:center; }
.prof-modal { background:var(--surface-1); border:1px solid var(--border); border-radius:10px; box-shadow:0 12px 40px rgba(0,0,0,.6); width:min(700px,95vw); max-height:85vh; display:flex; flex-direction:column; overflow:hidden; }
.prof-modal-hd { display:flex; align-items:center; gap:.6rem; padding:.8rem 1rem; border-bottom:1px solid var(--border); flex-shrink:0; }
.prof-modal-hd h3 { margin:0; font-size:15px; flex:1; color:var(--text-1); }
.prof-modal-body { padding:1rem; overflow-y:auto; flex:1; }
.prof-modal-desc { font-size:13px; color:var(--text-2); margin-bottom:.8rem; }
.prof-modal-imgs { display:flex; flex-wrap:wrap; gap:.5rem; }
.prof-modal-imgs img { height:140px; width:140px; object-fit:cover; border-radius:6px; cursor:pointer; border:2px solid transparent; transition:border-color .15s; }
.prof-modal-imgs img:hover { border-color:var(--accent,#4f8ef7); }
.prof-modal-empty { color:var(--text-3); font-size:13px; font-style:italic; }
.prof-lightbox { position:fixed; inset:0; background:rgba(0,0,0,.88); z-index:11000; display:flex; align-items:center; justify-content:center; cursor:zoom-out; }
.prof-lightbox img { max-width:90vw; max-height:90vh; object-fit:contain; border-radius:6px; box-shadow:0 8px 40px rgba(0,0,0,.8); }
.role-picker-header { font-size:12px; color:var(--text-2); margin-bottom:.5rem; }
.role-picker-caps { display:flex; flex-direction:column; gap:.3rem; }
.role-pick-btn { background:var(--surface-2); border:1px solid var(--border); border-radius:5px; color:var(--text-1); padding:.35rem .6rem; font-size:12px; cursor:pointer; font-family:inherit; text-align:left; display:flex; align-items:center; justify-content:space-between; gap:.4rem; }
......@@ -2635,6 +2695,188 @@ function getCapabilityDetails(sub) {
};
}
// ── Model pick block state ───────────────────────────────────────────────
let _mpVae = {}; // { sub: string } — VAE override per sub
let _mpLoras = {}; // { sub: [{model, weight, name}] }
const _MP_ROLE_LABELS = {
image_generation:'Image generation', image_to_image:'Image editing',
inpainting:'Inpainting', image_upscaling:'Upscaler', depth_estimation:'Depth estimator',
image_segmentation:'Segmentation', video_generation:'Video generation',
image_to_video:'Image → video', video_to_video:'Video editing',
video_interpolation:'Frame interpolation', video_upscaling:'Video upscaler',
subtitle_generation:'Subtitles', speech_to_text:'Transcription',
text_to_speech:'Voice synthesis', audio_generation:'Music / SFX',
audio_to_audio:'Voice conversion', embeddings:'Embedding model',
text_generation:'Language model', image_to_text:'Vision model',
};
const _IMAGE_SUBS = new Set(['img-gen','img-edit','img-inpaint','img-upscale','img-depth',
'img-seg','img-outfit','img-faceswap','img-deblur','img-unpix','img-to3d','img-from3d']);
const _VIDEO_SUBS_VL = new Set(['vid-t2v','vid-i2v','vid-v2v','vid-ti2v']);
function _mpBuildOpts(sub, cap) {
const assigned = capModelAssignments[sub]?.[cap] || activeModel?.id || '';
const capable = modelsForCap(cap);
if (!capable.length) {
return `<option value="">— no compatible model configured —</option>`;
}
let opts = `<option value="">— select —</option>`;
capable.forEach(m => {
const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
opts += `<option value="${escapeHtml(m.id)}"${m.id === assigned ? ' selected' : ''}>${escapeHtml(lbl)}</option>`;
});
return opts;
}
function _mpBuildComponentOpts(selected, pattern) {
const list = models.filter(m => pattern.test(m.id));
if (!list.length) return null;
let opts = `<option value="">— none —</option>`;
list.forEach(m => {
const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
opts += `<option value="${escapeHtml(m.id)}"${m.id === selected ? ' selected' : ''}>${escapeHtml(lbl)}</option>`;
});
return opts;
}
function _mpLoraEntry(sub, i, lora) {
const opts = _mpBuildComponentOpts(lora.model || '', /lora/i);
const sel = opts
? `<select class="model-pick-sel" onchange="_mpLoraChange('${sub}',${i},'model',this.value)">${opts}</select>`
: `<span class="model-pick-sel" style="color:var(--text-3);font-size:11px;display:flex;align-items:center">No LoRA models configured</span>`;
return `<div class="lora-entry" id="mp-lora-${sub}-${i}">
${sel}
${opts ? `<input type="number" class="lora-weight" value="${lora.weight??1}" step="0.05" min="0" max="2"
title="Weight" onchange="_mpLoraChange('${sub}',${i},'weight',parseFloat(this.value))">` : ''}
<button class="lora-remove" onclick="_mpLoraRemove('${sub}',${i})" title="Remove">✕</button>
</div>`;
}
function renderModelPickBlock(sub) {
const studioRule = STUDIO_CAPABILITIES[sub];
const subRule = SUB_CAPABILITY_RULES[sub];
const isMulti = studioRule && (studioRule.requires || []).length > 1;
const showVaeLora = _IMAGE_SUBS.has(sub) || _VIDEO_SUBS_VL.has(sub);
// Build the list of caps to show selectors for
let caps = [];
if (studioRule) {
const req = (studioRule.requires || []).filter(c => modelsForCap(c).length > 0);
const opt = (studioRule.optional || []).filter(c => modelsForCap(c).length > 0);
caps = req.map(c=>({cap:c,required:true})).concat(opt.map(c=>({cap:c,required:false})));
} else {
const primary = SUB_API_CAP[sub] || (subRule?.requiresAny||subRule?.optional||[])[0];
if (primary) caps = [{cap:primary, required:true}];
}
let rows = '';
if (caps.length === 0) {
rows = `<div class="model-pick-row" style="font-size:12px;color:var(--text-3)">Auto — no explicit capability required.</div>`;
} else {
rows = caps.map(({cap, required}) => {
const roleLabel = _MP_ROLE_LABELS[cap] || cap.replace(/_/g,' ');
const opts = _mpBuildOpts(sub, cap);
const selId = `mpsel-${sub.replace(/-/g,'_')}-${cap}`;
const optLabel = !required && isMulti ? ` <em style="font-weight:400;opacity:.65">(opt)</em>` : '';
return `<div class="model-pick-row">
${isMulti || caps.length > 1 ? `<span class="model-pick-role">${escapeHtml(roleLabel)}${optLabel}</span>` : ''}
<select class="model-pick-sel" id="${selId}" onchange="_mpSelChange('${sub}','${cap}',this.value)">
${opts}
</select>
</div>`;
}).join('');
}
// VAE / LoRA section
let vaeLoraSec = '';
if (showVaeLora) {
const vaeOpts = _mpBuildComponentOpts(_mpVae[sub] || '', /vae/i);
const hasLora = models.some(m => /lora/i.test(m.id));
const loraListHtml = (_mpLoras[sub] || []).map((l,i) => _mpLoraEntry(sub, i, l)).join('');
const hasAnyComponent = vaeOpts || hasLora;
if (hasAnyComponent) {
const vaeRow = vaeOpts
? `<div class="model-pick-row">
<span class="model-pick-role">VAE</span>
<select class="model-pick-sel" id="mpvae-${sub.replace(/-/g,'_')}" onchange="_mpVaeChange('${sub}',this.value)">
${vaeOpts}
</select>
</div>`
: '';
const loraSection = hasLora
? `<div class="model-pick-role" style="min-width:0;font-size:10px;color:var(--text-3);margin-top:.15rem">LoRA</div>
<div id="mp-loras-${sub}">${loraListHtml}</div>
<button class="btn btn-ghost btn-sm" onclick="_mpLoraAdd('${sub}')" style="align-self:flex-start;font-size:11px">+ Add LoRA</button>`
: '';
vaeLoraSec = `<div class="mp-extra"><details>
<summary>VAE / LoRA <em style="font-weight:400;opacity:.65">(optional overrides)</em></summary>
<div class="mp-extra-body">
${vaeRow}
${loraSection}
</div>
</details></div>`;
}
}
return `<div class="model-pick-block">
<div class="model-pick-title">${isMulti || caps.length > 1 ? 'Models' : 'Model'}</div>
${rows}
${caps.length > 0 ? `<div class="model-pick-hint"><span style="opacity:.6">● loaded in VRAM</span></div>` : ''}
${vaeLoraSec}
</div>`;
}
function _mpSelChange(sub, cap, modelId) {
if (!modelId) return;
const m = models.find(m => m.id === modelId);
if (!m) return;
if (cap) assignModelToCap(sub, cap, m);
else selectSubModel(sub, m);
}
function _mpVaeChange(sub, value) { _mpVae[sub] = value || null; }
function _mpLoraChange(sub, i, field, value) {
if (!_mpLoras[sub]) _mpLoras[sub] = [];
if (!_mpLoras[sub][i]) _mpLoras[sub][i] = {model:'', weight:1.0};
_mpLoras[sub][i][field] = value;
}
function _mpLoraAdd(sub) {
if (!_mpLoras[sub]) _mpLoras[sub] = [];
_mpLoras[sub].push({model:'', weight:1.0});
const el = document.getElementById(`mp-loras-${sub}`);
if (el) el.innerHTML = _mpLoras[sub].map((l,i) => _mpLoraEntry(sub, i, l)).join('');
}
function _mpLoraRemove(sub, idx) {
if (_mpLoras[sub]) {
_mpLoras[sub].splice(idx, 1);
const el = document.getElementById(`mp-loras-${sub}`);
if (el) el.innerHTML = _mpLoras[sub].map((l,i) => _mpLoraEntry(sub, i, l)).join('');
}
}
function _mpSyncSelect(sub, cap, modelId) {
const selId = `mpsel-${sub.replace(/-/g,'_')}-${cap}`;
const el = document.getElementById(selId);
if (el && modelId) el.value = modelId;
}
function getVaeForSub(sub) { return _mpVae[sub] || null; }
function getLorasForSub(sub) {
return (_mpLoras[sub] || [])
.filter(l => l.model)
.map((l, i) => ({
model: l.model,
weight: l.weight ?? 1.0,
name: l.name || l.model.split('/').pop().replace(/[^a-zA-Z0-9_-]/g,'_'),
}));
}
// ── end model pick block ─────────────────────────────────────────────────
function renderCapabilityCard(sub) {
const shell = $(`cap-${sub}`);
if (!shell) return;
......@@ -2667,9 +2909,9 @@ function renderCapabilityCard(sub) {
<span class="cap-chip">${details.backendPath}</span>
<span class="cap-chip">${details.io}</span>
</div>
${renderModelPickBlock(sub)}
${missingBits.join('')}
${notes}
${renderSubModelPicker(sub)}
`;
}
......@@ -2681,16 +2923,11 @@ function renderCapabilityCards() {
const shell = $(`cap-${sub}`);
if (!shell) return;
const state = currentTabState.subs[sub] || 'unavailable';
const picker = renderSubModelPicker(sub);
const picker = renderModelPickBlock(sub);
if (state === 'available') {
if (picker) {
shell.style.display = '';
shell.classList.remove('state-partial', 'state-unavailable');
shell.innerHTML = picker;
} else {
shell.style.display = 'none';
shell.innerHTML = '';
}
shell.style.display = '';
shell.classList.remove('state-partial', 'state-unavailable');
shell.innerHTML = picker;
return;
}
shell.style.display = '';
......@@ -2711,8 +2948,8 @@ function renderCapabilityCards() {
<div class="cap-card-title">${escapeHtml(label)}</div>
<span class="cap-chip${availabilityClass}">${availabilityLabel}</span>
</div>
${missingBits.join('')}
${picker}
${missingBits.join('')}
`;
});
renderAudioBackendHealth();
......@@ -2966,7 +3203,7 @@ function previewExportBody(endpoint, body) {
function buildAudioPreviewData() {
return previewExportBody(ROOT_PATH + '/v1/audio/generate', {
model: activeModel?.id || '',
model: modelForSub('aud-gen') || '',
prompt: val('ag-prompt'),
duration: fval('ag-dur') || 10,
temperature: fval('ag-temp') || 1.0,
......@@ -2980,7 +3217,7 @@ function buildAudioPreviewData() {
function buildTTSPreviewData() {
return previewExportBody(ROOT_PATH + '/v1/audio/speech', {
model: activeModel?.id || '',
model: modelForSub('aud-tts') || '',
input: val('at-text'),
voice: val('at-voice') || undefined,
speed: fval('at-speed') || 1.0,
......@@ -3000,8 +3237,10 @@ function buildSTTPreviewData() {
}
function buildImageGenPreviewData() {
const loras = getLorasForSub('img-gen');
const vae = getVaeForSub('img-gen');
return previewExportBody(ROOT_PATH + '/v1/images/generations', {
model: activeModel?.id || '',
model: modelForSub('img-gen') || '',
prompt: val('ig-prompt'),
negative_prompt: val('ig-neg') || undefined,
size: `${ival('ig-w') || 1024}x${ival('ig-h') || 1024}`,
......@@ -3011,6 +3250,8 @@ function buildImageGenPreviewData() {
n: ival('ig-n') || 1,
response_format: 'url',
safety_checker: chk('ig-nosafe') ? false : undefined,
...(vae ? {vae_model: vae} : {}),
...(loras.length ? {loras} : {}),
});
}
......@@ -3018,7 +3259,7 @@ function buildEmbeddingsPreviewData() {
const lines = val('em-text').split('\n').filter(l => l.trim());
const input = lines.length <= 1 ? (lines[0] || '') : lines;
return previewExportBody(ROOT_PATH + '/v1/embeddings', {
model: activeModel?.id || '',
model: modelForSub('embed') || '',
input,
encoding_format: val('em-enc') || 'float',
dimensions: val('em-dims') ? ival('em-dims') : undefined,
......@@ -3190,6 +3431,8 @@ const REQUEST_PREVIEW_CONFIG = {
{ label:'Steps', value:preview => preview.body.steps },
{ label:'CFG', value:preview => preview.body.guidance_scale },
{ label:'Count', value:preview => preview.body.n },
{ label:'VAE', value:preview => preview.body.vae_model || '' },
{ label:'LoRA', value:preview => preview.body.loras?.length ? preview.body.loras.map(l=>l.model.split('/').pop()).join(', ') : '' },
],
},
'embed': {
......@@ -3333,6 +3576,7 @@ async function loadModels() {
models = deduplicateModels(d.data || []);
renderSidebar();
if (models.length) selectModel(models[0]);
pcPopulateModelSelect(); pePopulateModelSelect();
} catch(e) {
$('model-list').innerHTML = '<div class="muted small" style="padding:.5rem .6rem">Failed to load models</div>';
}
......@@ -3467,6 +3711,8 @@ function selectCat(cat) {
document.querySelectorAll('.t1btn').forEach(b => b.classList.toggle('active', b.dataset.cat === cat));
const hasL2 = ['image','video','audio','3d','profiles'].includes(cat);
$('tabbar2').classList.toggle('visible', hasL2);
const isChatLike = cat === 'chat' || cat === 'embed';
document.querySelector('.sidebar')?.classList.toggle('hidden', !isChatLike);
if (!hasL2) {
clearSidebarHighlights();
document.querySelectorAll('.panel').forEach(p => p.classList.remove('active'));
......@@ -3480,7 +3726,7 @@ function selectCat(cat) {
btn.dataset.catVisible = belongsHere ? cat : '';
btn.classList.toggle('state-hidden', !belongsHere);
});
if (cat === 'profiles') { profCharLoad(); profEnvLoad(); profVoiceLoad(); }
if (cat === 'profiles') { profCharLoad(); profEnvLoad(); profVoiceLoad(); pcPopulateModelSelect(); pePopulateModelSelect(); }
const activeSub = document.querySelector('.t2btn.active');
const activeSubFits = activeSub && isSubVisibleForCategory(activeSub.dataset.sub, cat);
const nextSub = activeSubFits ? activeSub.dataset.sub : getFirstVisibleSub(cat)?.dataset.sub;
......@@ -3575,6 +3821,8 @@ function assignModelToCap(sub, cap, model) {
document.querySelectorAll('.model-item').forEach(el =>
el.classList.toggle('active', el.dataset.id === model.id));
}
// Sync the select element in the panel (if rendered, avoids full re-render).
if (cap) _mpSyncSelect(sub, cap, model.id);
renderCapabilityCards();
if (SUB_CAT[sub]) highlightSidebarForSub(sub);
}
......@@ -4273,6 +4521,8 @@ async function genImage() {
try {
const igCharProfiles = getCharProfilesList('ig');
const igEnvProfiles = getEnvProfilesList('ig');
const _igLoras = getLorasForSub('img-gen');
const _igVae = getVaeForSub('img-gen');
const d = await post('/v1/images/generations', {
model:modelForSub('img-gen'), prompt:val('ig-prompt'),
size:val('ig-w')+'x'+val('ig-h'),
......@@ -4282,6 +4532,8 @@ async function genImage() {
...(val('ig-neg') ? {negative_prompt:val('ig-neg')} : {}),
disable_safety_checker: chk('ig-nosafe'),
response_format:'url',
...(_igVae ? {vae_model:_igVae} : {}),
...(_igLoras.length ? {loras:_igLoras} : {}),
...(igCharProfiles.length ? {character_profiles:igCharProfiles, character_strength:fval('ig-char-str')||0.6} : {}),
...(igEnvProfiles.length ? {environment_profiles:igEnvProfiles, environment_strength:fval('ig-env-str')||0.6} : {}),
});
......@@ -5633,12 +5885,12 @@ function profCharSubmit() {
function pcPopulateModelSelect() {
const sel = $('pc-gen-model'); if (!sel) return;
const cur = sel.value;
// Collect image-capable models from the cached model list
const opts = ['<option value="">Default image model</option>'];
const opts = ['<option value="">— select model —</option>'];
(models || []).forEach(m => {
const caps = m.capabilities || [];
if (caps.includes('image_generation') || caps.includes('image_to_image')) {
opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(m.id)}</option>`);
if (caps.includes('image_generation')) {
const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(lbl)}</option>`);
}
});
sel.innerHTML = opts.join('');
......@@ -5763,11 +6015,43 @@ function renderCharList() {
}).join('');
}
function _openLightbox(src) {
const lb = document.createElement('div');
lb.className = 'prof-lightbox';
lb.innerHTML = `<img src="${escapeHtml(src)}">`;
lb.addEventListener('click', () => lb.remove());
document.body.appendChild(lb);
}
function _openProfModal(title, description, images) {
const existing = document.getElementById('prof-view-modal');
if (existing) existing.remove();
const imgHtml = images.length
? images.map((img, i) => `<img src="${escapeHtml(img.data)}" title="${escapeHtml(img.label||`image ${i+1}`)}" data-src="${escapeHtml(img.data)}" onclick="_openLightbox(this.dataset.src)">`).join('')
: `<div class="prof-modal-empty">No images stored.</div>`;
const backdrop = document.createElement('div');
backdrop.id = 'prof-view-modal';
backdrop.className = 'prof-modal-backdrop';
backdrop.innerHTML = `
<div class="prof-modal">
<div class="prof-modal-hd">
<h3>${escapeHtml(title)}</h3>
<button class="btn btn-ghost btn-sm" onclick="document.getElementById('prof-view-modal').remove()">✕ Close</button>
</div>
<div class="prof-modal-body">
${description ? `<div class="prof-modal-desc">${escapeHtml(description)}</div>` : ''}
<div class="prof-modal-imgs">${imgHtml}</div>
</div>
</div>`;
backdrop.addEventListener('click', e => { if (e.target === backdrop) backdrop.remove(); });
document.body.appendChild(backdrop);
}
async function profCharView(name) {
const d = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name)).then(r=>r.json());
const imgs = (d.images||[]).map(img=>`<img src="${img.data}" style="height:80px;border-radius:4px;object-fit:cover" title="${escapeHtml(img.label||'')}">`).join('');
alert(`Character: ${d.name}\nDescription: ${d.description||'—'}\nImages: ${d.image_count}\n\n(Images are shown in console; open DevTools to inspect)`);
console.log('[profCharView]', d.name, d);
try {
const d = await fetch(ROOT_PATH + '/admin/api/characters/'+encodeURIComponent(name)).then(r=>r.json());
_openProfModal(`Character: ${d.name}`, d.description||'', d.images||[]);
} catch(e) { alert('Failed to load character: ' + e.message); }
}
async function profCharDelete(name) {
......@@ -5892,11 +6176,12 @@ function profEnvSubmit() {
function pePopulateModelSelect() {
const sel = $('pe-gen-model'); if (!sel) return;
const cur = sel.value;
const opts = ['<option value="">Default image model</option>'];
const opts = ['<option value="">— select model —</option>'];
(models || []).forEach(m => {
const caps = m.capabilities || [];
if (caps.includes('image_generation') || caps.includes('image_to_image')) {
opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(m.id)}</option>`);
if (caps.includes('image_generation')) {
const lbl = m.id.split('/').pop() + (m.load_mode === 'load' ? ' ●' : '');
opts.push(`<option value="${escapeHtml(m.id)}">${escapeHtml(lbl)}</option>`);
}
});
sel.innerHTML = opts.join('');
......@@ -6008,9 +6293,10 @@ function renderEnvList() {
}
async function profEnvView(name) {
const d = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name)).then(r=>r.json());
alert(`Environment: ${d.name}\nDescription: ${d.description||''}\nImages: ${d.image_count}\n\n(Images are shown in console; open DevTools to inspect)`);
console.log('[profEnvView]', d.name, d);
try {
const d = await fetch(ROOT_PATH + '/admin/api/environments/'+encodeURIComponent(name)).then(r=>r.json());
_openProfModal(`Environment: ${d.name}`, d.description||'', d.images||[]);
} catch(e) { alert('Failed to load environment: ' + e.message); }
}
async function profEnvDelete(name) {
......
......@@ -102,6 +102,57 @@
</div>
</div>
<!-- Thermal protection -->
<div class="card mb-0" style="margin-top:1rem">
<div class="card-title">Thermal Protection</div>
<span class="form-hint" style="display:block;margin-bottom:.75rem">
Before serving a request against a loaded model, wait until temperatures are
safe so a long sequence of heavy generations can't overheat the machine and
trip its power-off protection. The wait is non-blocking (other requests keep
being accepted) and takes effect immediately on save. Temperatures in °C.
</span>
<div class="form-row">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-therm-gpu-enabled" onchange="toggleThermalFields()">
<span style="font-size:13px;font-weight:500">Enable GPU temperature protection</span>
</label>
</div>
<div id="therm-gpu-fields" class="form-row" style="display:grid;grid-template-columns:1fr 1fr;gap:1rem">
<div>
<label class="form-label">Pause when GPU reaches (°C)</label>
<input type="number" id="s-therm-gpu-high" class="form-input" min="40" max="120" step="1" placeholder="90">
</div>
<div>
<label class="form-label">Resume when GPU drops to (°C)</label>
<input type="number" id="s-therm-gpu-resume" class="form-input" min="30" max="120" step="1" placeholder="87">
</div>
</div>
<div class="form-row" style="margin-top:.5rem">
<label style="display:flex;align-items:center;gap:.5rem;cursor:pointer">
<input type="checkbox" id="s-therm-cpu-enabled" onchange="toggleThermalFields()">
<span style="font-size:13px;font-weight:500">Enable CPU temperature protection</span>
</label>
</div>
<div id="therm-cpu-fields" class="form-row" style="display:grid;grid-template-columns:1fr 1fr;gap:1rem">
<div>
<label class="form-label">Pause when CPU reaches (°C)</label>
<input type="number" id="s-therm-cpu-high" class="form-input" min="40" max="120" step="1" placeholder="90">
</div>
<div>
<label class="form-label">Resume when CPU drops to (°C)</label>
<input type="number" id="s-therm-cpu-resume" class="form-input" min="30" max="120" step="1" placeholder="87">
</div>
</div>
<div class="form-row" style="margin:0">
<label class="form-label">Re-check interval while cooling down (seconds)</label>
<input type="number" id="s-therm-poll" class="form-input" style="max-width:200px" min="1" max="120" step="1" placeholder="5">
<span class="form-hint">How often to re-read temperatures while waiting for cooldown.</span>
</div>
</div>
<div class="card mb-0" style="margin-top:1rem">
<div class="card-title">AISBF Broker</div>
<div class="form-row">
......@@ -210,6 +261,13 @@ function toggleBrokerFields(){
}
}
function toggleThermalFields(){
document.getElementById('therm-gpu-fields').style.display =
document.getElementById('s-therm-gpu-enabled').checked ? 'grid' : 'none';
document.getElementById('therm-cpu-fields').style.display =
document.getElementById('s-therm-cpu-enabled').checked ? 'grid' : 'none';
}
function showAlert(type, msg){
const el = document.getElementById('settings-alert');
el.className = 'alert alert-' + (type === 'error' ? 'error' : 'info');
......@@ -260,6 +318,16 @@ async function loadSettings(){
document.getElementById('s-broker-reconnect-max').value = broker.reconnect_max_delay_seconds ?? 60;
document.getElementById('s-broker-ws-ping').value = broker.websocket_ping_interval ?? 20;
toggleBrokerFields();
// Thermal protection
const therm = d.thermal || {};
document.getElementById('s-therm-gpu-enabled').checked = therm.gpu_enabled !== false;
document.getElementById('s-therm-cpu-enabled').checked = therm.cpu_enabled !== false;
document.getElementById('s-therm-gpu-high').value = therm.gpu_high ?? 90;
document.getElementById('s-therm-gpu-resume').value = therm.gpu_resume ?? 87;
document.getElementById('s-therm-cpu-high').value = therm.cpu_high ?? 90;
document.getElementById('s-therm-cpu-resume').value = therm.cpu_resume ?? 87;
document.getElementById('s-therm-poll').value = therm.poll_seconds ?? 5;
toggleThermalFields();
}catch(e){ showAlert('error','Failed to load settings: '+e.message); }
}
......@@ -286,6 +354,15 @@ async function saveSettings(){
directory: document.getElementById('s-arc-dir').value.trim(),
retention: document.getElementById('s-arc-retention').value,
},
thermal:{
gpu_enabled: document.getElementById('s-therm-gpu-enabled').checked,
cpu_enabled: document.getElementById('s-therm-cpu-enabled').checked,
gpu_high: parseFloat(document.getElementById('s-therm-gpu-high').value) || 90,
gpu_resume: parseFloat(document.getElementById('s-therm-gpu-resume').value) || 87,
cpu_high: parseFloat(document.getElementById('s-therm-cpu-high').value) || 90,
cpu_resume: parseFloat(document.getElementById('s-therm-cpu-resume').value) || 87,
poll_seconds: parseFloat(document.getElementById('s-therm-poll').value) || 5,
},
broker:{
enabled: document.getElementById('s-broker-enabled').checked,
base_url: document.getElementById('s-broker-base-url').value.trim(),
......@@ -310,7 +387,7 @@ async function saveSettings(){
method:'POST', headers:{'Content-Type':'application/json'},
body: JSON.stringify(data)
});
if(r.ok) showAlert('info','Settings saved. Archive changes take effect immediately; restart CoderAI for other changes.');
if(r.ok) showAlert('info','Settings saved. Archive and thermal-protection changes take effect immediately; restart CoderAI for other changes.');
else{ const e=await r.json(); showAlert('error', e.detail||'Save failed'); }
}catch(e){ showAlert('error','Error: '+e.message); }
}
......
......@@ -139,6 +139,7 @@ from codai.api.voice_clone import router as voice_clone_router
from codai.api.voice_convert import router as voice_convert_router
from codai.api.faceswap import router as faceswap_router
from codai.api.characters import router as characters_router
from codai.api.loras import router as loras_router
from codai.api.spatial import router as spatial_router
from codai.api.environments import router as environments_router
from codai.admin.routes import router as admin_router
......@@ -203,6 +204,7 @@ app.include_router(voice_clone_router)
app.include_router(voice_convert_router)
app.include_router(faceswap_router)
app.include_router(characters_router)
app.include_router(loras_router)
app.include_router(environments_router)
app.include_router(spatial_router)
app.include_router(admin_router)
......
......@@ -119,11 +119,35 @@ def _load_musicgen(model_name: str, device: str):
return model
def _load_audioldm(model_name: str, device: str):
def _load_audioldm(model_name: str, device: str, model_config: dict = None):
import torch
from diffusers import AudioLDM2Pipeline
pipe = AudioLDM2Pipeline.from_pretrained(model_name, torch_dtype=torch.float16)
pipe = pipe.to(device)
from codai.models.hf_loading import resolve_dtype
dtype = resolve_dtype(model_config, default='f16')
_xtra = {}
# Apply 4-bit/8-bit quantization to the diffusion backbone when configured.
_mc = model_config or {}
if _mc.get('load_in_4bit') or _mc.get('load_in_8bit'):
_bits = 4 if _mc.get('load_in_4bit') else 8
try:
from diffusers.quantizers import PipelineQuantizationConfig
_qk = ({'load_in_4bit': True, 'bnb_4bit_compute_dtype': dtype}
if _mc.get('load_in_4bit') else {'load_in_8bit': True})
_xtra['quantization_config'] = PipelineQuantizationConfig(
quant_backend=f"bitsandbytes_{_bits}bit",
quant_kwargs=_qk,
components_to_quantize=["transformer", "unet"],
)
print(f"AudioLDM quantization: {_bits}-bit (bitsandbytes)")
except Exception as e:
print(f"AudioLDM quantization unavailable: {e}")
pipe = AudioLDM2Pipeline.from_pretrained(model_name, torch_dtype=dtype, **_xtra)
# CPU offload when configured; otherwise place on device (skip for quantized).
_off = _mc.get('offload_strategy')
if _off in ('cpu', 'sequential', 'model', 'disk') and hasattr(pipe, 'enable_model_cpu_offload'):
pipe.enable_model_cpu_offload()
elif 'quantization_config' not in _xtra:
pipe = pipe.to(device)
return pipe
......@@ -224,7 +248,8 @@ async def audio_generate(request: AudioGenerationRequest, http_request: Request
Compatible models: MusicGen, AudioGen, AudioLDM2, StableAudio.
"""
_aud_progress_loading(request.model or "audio")
model_info = multi_model_manager.request_model(request.model, model_type="audio_gen")
model_info = await asyncio.to_thread(
multi_model_manager.request_model, request.model, model_type="audio_gen")
model_name = model_info.get('model_name')
if not model_name:
err = model_info.get('error', f"Model '{request.model}' not found")
......@@ -236,13 +261,14 @@ async def audio_generate(request: AudioGenerationRequest, http_request: Request
if pipe is None:
device = _derive_device()
model_type = _detect_audio_gen_type(model_name)
_ag_cfg = model_info.get('config') or {}
try:
if model_type in ('musicgen', 'audiogen'):
pipe = await asyncio.get_event_loop().run_in_executor(
None, _load_musicgen, model_name, device)
else:
pipe = await asyncio.get_event_loop().run_in_executor(
None, _load_audioldm, model_name, device)
None, _load_audioldm, model_name, device, _ag_cfg)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to load audio gen model: {e}")
multi_model_manager.models[model_key] = pipe
......
......@@ -37,9 +37,38 @@ import tempfile
import time
from typing import List, Optional
from fastapi import APIRouter, HTTPException, Request
from fastapi import APIRouter, Depends, HTTPException, Request
from pydantic import BaseModel, ConfigDict
def _require_api_auth(request: Request) -> None:
"""Raise 401 if auth is enabled and the request carries no valid credential."""
try:
from codai.admin import routes as _admin_routes
sm = _admin_routes.session_manager
except Exception:
return # auth subsystem unavailable — allow through
if sm is None:
return # auth not configured on this instance
auth = request.headers.get("authorization", "")
if auth.lower().startswith("bearer "):
token = auth[7:].strip()
if sm.verify_token(token):
return
cookie = request.cookies.get("session", "")
if cookie.endswith(".MUST_CHANGE"):
cookie = cookie[:-12]
if cookie and sm.validate_session(cookie):
return
raise HTTPException(
status_code=401,
detail={"message": "Invalid API key. Provide a valid Bearer token.",
"type": "invalid_request_error", "code": "invalid_api_key"},
)
from codai.platform_paths import default_characters_dir, legacy_style_config_dir
router = APIRouter()
......@@ -211,7 +240,12 @@ def _decode_source(data: str) -> bytes:
def _detect_faces_cv2(img_bytes: bytes):
"""Return list of (x,y,w,h) face rects using Haar cascade, or [] if cv2 unavailable."""
"""
Return list of (x,y,w,h) face rects, largest first.
Tries MediaPipe (most accurate), then OpenCV DNN, then Haar cascade as fallback.
Detections smaller than 2% of image area are discarded as false positives.
Returns [] if no library is available or no plausible face is found.
"""
try:
import cv2
import numpy as np
......@@ -219,19 +253,56 @@ def _detect_faces_cv2(img_bytes: bytes):
img = cv2.imdecode(arr, cv2.IMREAD_COLOR)
if img is None:
return []
ih, iw = img.shape[:2]
img_area = ih * iw
min_face_area = img_area * 0.02 # reject anything < 2% of image
# ── Try MediaPipe first (most accurate, no model download needed) ──
try:
import mediapipe as mp
mp_face = mp.solutions.face_detection
with mp_face.FaceDetection(model_selection=1, min_detection_confidence=0.5) as det:
rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
results = det.process(rgb)
if results.detections:
rects = []
for d in results.detections:
bb = d.location_data.relative_bounding_box
x = int(bb.xmin * iw)
y = int(bb.ymin * ih)
w = int(bb.width * iw)
h = int(bb.height * ih)
if w * h >= min_face_area:
rects.append((x, y, w, h))
if rects:
rects.sort(key=lambda r: r[2]*r[3], reverse=True)
return rects
except ImportError:
pass
# ── Haar cascade fallback (stricter parameters to reduce false positives) ──
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
gray = cv2.equalizeHist(gray)
cascade_path = cv2.data.haarcascades + 'haarcascade_frontalface_default.xml'
cascade = cv2.CascadeClassifier(cascade_path)
faces = cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(40, 40))
# minSize scaled to image: at least 8% of the shorter dimension
min_dim = int(min(iw, ih) * 0.08)
faces = cascade.detectMultiScale(
gray, scaleFactor=1.05, minNeighbors=8,
minSize=(max(40, min_dim), max(40, min_dim)),
)
if len(faces) == 0:
return []
return [(int(x), int(y), int(w), int(h)) for x, y, w, h in faces]
rects = [(int(x), int(y), int(w), int(h)) for x, y, w, h in faces
if int(w) * int(h) >= min_face_area]
rects.sort(key=lambda r: r[2]*r[3], reverse=True)
return rects
except Exception:
return []
def _crop_face(img_bytes: bytes, rect) -> Optional[bytes]:
"""Crop a face rect (with padding) from an image, return PNG bytes."""
"""Crop a face rect with generous padding (head-and-shoulders), return PNG bytes."""
try:
import cv2
import numpy as np
......@@ -241,11 +312,15 @@ def _crop_face(img_bytes: bytes, rect) -> Optional[bytes]:
if img is None:
return None
ih, iw = img.shape[:2]
pad = int(max(w, h) * 0.4)
x1 = max(0, x - pad)
y1 = max(0, y - pad)
x2 = min(iw, x + w + pad)
y2 = min(ih, y + h + pad)
side = max(w, h)
# More padding on top to include hair/forehead, less at bottom
pad_sides = int(side * 0.5)
pad_top = int(side * 0.7)
pad_bot = int(side * 0.4)
x1 = max(0, x - pad_sides)
y1 = max(0, y - pad_top)
x2 = min(iw, x + w + pad_sides)
y2 = min(ih, y + h + pad_bot)
crop = img[y1:y2, x1:x2]
ok, buf = cv2.imencode('.png', crop)
return bytes(buf) if ok else None
......@@ -274,7 +349,7 @@ def _extract_from_image(img_bytes: bytes) -> List[bytes]:
crops = [c for f in faces for c in [_crop_face(img_bytes, f)] if c]
if crops:
return crops
# No face detected — use whole image as reference
# No face detected (or all detections filtered as false positives) — use whole image
try:
from PIL import Image as PILImage
img = PILImage.open(io.BytesIO(img_bytes)).convert('RGB')
......@@ -345,7 +420,7 @@ def resolve_character_profiles(profile_names: List[str]) -> List[str]:
# ── Endpoints ─────────────────────────────────────────────────────────────────
@router.post("/v1/characters")
async def save_character(req: CharacterSaveRequest):
async def save_character(req: CharacterSaveRequest, _auth=Depends(_require_api_auth)):
"""Save or update a named character profile."""
if not req.name or '/' in req.name or '..' in req.name:
raise HTTPException(status_code=400, detail="Invalid character name")
......@@ -356,13 +431,13 @@ async def save_character(req: CharacterSaveRequest):
@router.get("/v1/characters")
async def list_characters():
async def list_characters(_auth=Depends(_require_api_auth)):
"""List all saved character profiles (metadata only, no images)."""
return {"characters": _list_characters()}
@router.get("/v1/characters/{name}")
async def get_character(name: str):
async def get_character(name: str, _auth=Depends(_require_api_auth)):
"""Get a character profile including its reference images as base64."""
meta = _load_character_meta(name)
if not meta:
......@@ -378,7 +453,7 @@ async def get_character(name: str):
@router.delete("/v1/characters/{name}")
async def delete_character(name: str):
async def delete_character(name: str, _auth=Depends(_require_api_auth)):
"""Delete a character profile."""
cdir = _char_dir(name)
if not os.path.isdir(cdir):
......@@ -389,7 +464,7 @@ async def delete_character(name: str):
@router.patch("/v1/characters/{name}")
async def patch_character(name: str, req: CharacterPatchRequest):
async def patch_character(name: str, req: CharacterPatchRequest, _auth=Depends(_require_api_auth)):
"""Update a character profile: description, add images, or remove images by index."""
meta = _load_character_meta(name)
if not meta:
......@@ -462,29 +537,24 @@ async def generate_character(req: CharacterGenerateRequest, request: Request):
if req.steps:
payload["steps"] = req.steps
# Forward the caller's auth token so rate-limit / auth middleware passes
auth_header = request.headers.get("authorization", "")
headers = {"Content-Type": "application/json"}
if auth_header:
headers["Authorization"] = auth_header
try:
from httpx import AsyncClient, ASGITransport
async with AsyncClient(
transport=ASGITransport(app=request.app),
base_url="http://internal",
timeout=300,
) as client:
r = await client.post("/v1/images/generations", json=payload, headers=headers)
if not r.is_success:
import json as _json
from codai.broker.asgi_bridge import execute_internal_request
resp = await execute_internal_request(
request.app,
method="POST",
path="/v1/images/generations",
headers={"Content-Type": "application/json"},
body=_json.dumps(payload).encode(),
)
if resp["status_code"] >= 400:
try:
detail = r.json().get("detail", r.text)
detail = _json.loads(resp["body"]).get("detail", resp["body"].decode())
except Exception:
detail = r.text
raise HTTPException(status_code=r.status_code, detail=f"Image generation failed: {detail}")
detail = resp["body"].decode()
raise HTTPException(status_code=resp["status_code"], detail=f"Image generation failed: {detail}")
images_data = r.json().get("data", [])
images_data = _json.loads(resp["body"]).get("data", [])
except HTTPException:
raise
except Exception as e:
......
......@@ -48,10 +48,16 @@ def _derive_device() -> str:
return "cuda:0"
def _load_embedding_model(model_name: str, device: str):
def _load_embedding_model(model_name: str, device: str, model_config: dict = None):
from codai.models.hf_loading import build_from_pretrained_kwargs
try:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(model_name, device=device)
# sentence-transformers honours quantization via model_kwargs.
fp = build_from_pretrained_kwargs(model_config)
st_kwargs = {}
if 'quantization_config' in fp:
st_kwargs['model_kwargs'] = {'quantization_config': fp['quantization_config']}
model = SentenceTransformer(model_name, device=device, **st_kwargs)
return ('sentence_transformers', model)
except ImportError:
pass
......@@ -59,8 +65,11 @@ def _load_embedding_model(model_name: str, device: str):
try:
from transformers import AutoTokenizer, AutoModel
import torch
fp = build_from_pretrained_kwargs(model_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).to(device)
model = AutoModel.from_pretrained(model_name, **fp)
if 'quantization_config' not in fp and 'device_map' not in fp:
model = model.to(device)
return ('transformers', (tokenizer, model, device))
except Exception as e:
raise RuntimeError(f"Cannot load embedding model '{model_name}': {e}")
......@@ -97,7 +106,8 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
"""
OpenAI-compatible embeddings endpoint.
"""
model_info = multi_model_manager.request_model(request.model, model_type="embedding")
model_info = await asyncio.to_thread(
multi_model_manager.request_model, request.model, model_type="embedding")
model_name = model_info.get('model_name')
if not model_name:
err = model_info.get('error', f"Model '{request.model}' not found")
......@@ -108,9 +118,11 @@ async def create_embeddings(request: EmbeddingsRequest, http_request: Request =
if model_obj is None:
device = _derive_device()
_emb_cfg = (multi_model_manager.config.get(f"embedding:{model_name}")
or multi_model_manager.config.get(model_name) or {})
try:
model_obj = await asyncio.get_event_loop().run_in_executor(
None, _load_embedding_model, model_name, device)
None, _load_embedding_model, model_name, device, _emb_cfg)
except Exception as e:
raise HTTPException(status_code=500, detail=f"Failed to load embedding model: {e}")
multi_model_manager.models[model_key] = model_obj
......
......@@ -39,7 +39,32 @@ import tempfile
import time
from typing import List, Optional
from fastapi import APIRouter, HTTPException, Request
from fastapi import APIRouter, Depends, HTTPException, Request
def _require_api_auth(request: Request) -> None:
"""Raise 401 if auth is enabled and the request carries no valid credential."""
try:
from codai.admin import routes as _admin_routes
sm = _admin_routes.session_manager
except Exception:
return
if sm is None:
return
auth = request.headers.get("authorization", "")
if auth.lower().startswith("bearer "):
if sm.verify_token(auth[7:].strip()):
return
cookie = request.cookies.get("session", "")
if cookie.endswith(".MUST_CHANGE"):
cookie = cookie[:-12]
if cookie and sm.validate_session(cookie):
return
raise HTTPException(
status_code=401,
detail={"message": "Invalid API key. Provide a valid Bearer token.",
"type": "invalid_request_error", "code": "invalid_api_key"},
)
from pydantic import BaseModel, ConfigDict
from codai.platform_paths import default_environments_dir, legacy_style_config_dir
......@@ -283,7 +308,7 @@ def resolve_environment_profiles(profile_names: List[str]) -> List[str]:
# ── Endpoints ─────────────────────────────────────────────────────────────────
@router.post("/v1/environments")
async def save_environment(req: EnvironmentSaveRequest):
async def save_environment(req: EnvironmentSaveRequest, _auth=Depends(_require_api_auth)):
"""Save or update a named environment profile."""
if not req.name or '/' in req.name or '..' in req.name:
raise HTTPException(status_code=400, detail="Invalid environment name")
......@@ -294,13 +319,13 @@ async def save_environment(req: EnvironmentSaveRequest):
@router.get("/v1/environments")
async def list_environments():
async def list_environments(_auth=Depends(_require_api_auth)):
"""List all saved environment profiles (metadata only)."""
return {"environments": _list_environments()}
@router.get("/v1/environments/{name}")
async def get_environment(name: str):
async def get_environment(name: str, _auth=Depends(_require_api_auth)):
"""Get an environment profile including its reference images as base64."""
meta = _load_environment_meta(name)
if not meta:
......@@ -316,7 +341,7 @@ async def get_environment(name: str):
@router.delete("/v1/environments/{name}")
async def delete_environment(name: str):
async def delete_environment(name: str, _auth=Depends(_require_api_auth)):
"""Delete an environment profile."""
edir = _env_dir(name)
if not os.path.isdir(edir):
......@@ -327,7 +352,7 @@ async def delete_environment(name: str):
@router.patch("/v1/environments/{name}")
async def patch_environment(name: str, req: EnvironmentPatchRequest):
async def patch_environment(name: str, req: EnvironmentPatchRequest, _auth=Depends(_require_api_auth)):
"""Update an environment profile: description, add images, or remove images by index."""
meta = _load_environment_meta(name)
if not meta:
......@@ -398,28 +423,24 @@ async def generate_environment(req: EnvironmentGenerateRequest, request: Request
if req.steps:
payload["steps"] = req.steps
auth_header = request.headers.get("authorization", "")
headers = {"Content-Type": "application/json"}
if auth_header:
headers["Authorization"] = auth_header
try:
from httpx import AsyncClient, ASGITransport
async with AsyncClient(
transport=ASGITransport(app=request.app),
base_url="http://internal",
timeout=300,
) as client:
r = await client.post("/v1/images/generations", json=payload, headers=headers)
if not r.is_success:
import json as _json
from codai.broker.asgi_bridge import execute_internal_request
resp = await execute_internal_request(
request.app,
method="POST",
path="/v1/images/generations",
headers={"Content-Type": "application/json"},
body=_json.dumps(payload).encode(),
)
if resp["status_code"] >= 400:
try:
detail = r.json().get("detail", r.text)
detail = _json.loads(resp["body"]).get("detail", resp["body"].decode())
except Exception:
detail = r.text
raise HTTPException(status_code=r.status_code, detail=f"Image generation failed: {detail}")
detail = resp["body"].decode()
raise HTTPException(status_code=resp["status_code"], detail=f"Image generation failed: {detail}")
images_data = r.json().get("data", [])
images_data = _json.loads(resp["body"]).get("data", [])
except HTTPException:
raise
except Exception as e:
......
......@@ -283,25 +283,69 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
# Continue with original implementation for 'auto' parser
# Get the model for this request
requested_model = request.model
# Use the manager to resolve the model and manage VRAM (handles ondemand unloading)
model_info = multi_model_manager.request_model(
requested_model=requested_model,
model_type="text"
)
# Check if the model was rejected as not allowed
if model_info.get('error'):
raise HTTPException(status_code=404, detail=model_info['error'])
# Acquire the least-busy instance (increments ref-count; released on response completion)
_model_key = model_info.get('model_key')
# Resolve and load the model, waiting if another model is currently loading.
# Retries up to ~5 minutes (60 × 5s) so requests queue behind long video loads
# rather than failing immediately with "No model loaded".
_MAX_WAIT_TRIES = 60
_model_key = None
_instance_idx = None
_acq = multi_model_manager.acquire_model_instance(_model_key) if _model_key else None
if _acq:
_instance_idx, mm = _acq
else:
mm = multi_model_manager.get_model_for_request(requested_model)
mm = None
model_info = {}
for _attempt in range(_MAX_WAIT_TRIES):
# Fail fast on a corrupted CUDA context — retrying 60× is pointless.
if getattr(multi_model_manager, 'cuda_context_poisoned', False):
raise HTTPException(status_code=503, detail=(
"CUDA context corrupted by an earlier device-side assert "
f"({multi_model_manager.cuda_poison_reason}). Restart coderai to recover."))
# If another model is loading, yield the event loop and wait for it to finish.
if not multi_model_manager._model_ready_event.is_set():
print(f"Text model '{requested_model}': waiting for model load to complete "
f"(attempt {_attempt + 1}/{_MAX_WAIT_TRIES})…")
await asyncio.to_thread(
multi_model_manager._model_ready_event.wait, 30.0
)
await asyncio.sleep(0)
# In a thread: request_model may block waiting for a busy model to go
# idle before evicting it; blocking the event loop here would deadlock.
model_info = await asyncio.to_thread(
multi_model_manager.request_model,
requested_model,
"text",
)
if model_info.get('error'):
# CUDA-poison errors are unrecoverable → 503; others (unknown model) → 404.
_status = 503 if 'CUDA context corrupted' in str(model_info['error']) else 404
raise HTTPException(status_code=_status, detail=model_info['error'])
_model_key = model_info.get('model_key')
_candidate = None
_acq = multi_model_manager.acquire_model_instance(_model_key) if _model_key else None
if _acq:
_instance_idx, _candidate = _acq
# Guard against stale pool entries (model evicted but pool not cleared)
if hasattr(_candidate, 'backend') and _candidate.backend is None:
multi_model_manager.release_model_instance(_model_key, _instance_idx)
_instance_idx = None
_candidate = None
if _candidate is None:
_candidate = multi_model_manager.get_model_for_request(requested_model)
if _candidate is None and model_manager.backend is not None:
_candidate = model_manager
# Validate the candidate has a working backend before accepting it
if _candidate is not None:
if hasattr(_candidate, 'backend') and _candidate.backend is None:
_candidate = None
if _candidate is not None:
mm = _candidate
break
print(f"Text model '{requested_model}' not ready, retrying in 5s "
f"(attempt {_attempt + 1}/{_MAX_WAIT_TRIES})…")
await asyncio.sleep(5)
def _release_instance():
if _instance_idx is not None and _model_key:
......@@ -309,12 +353,10 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
if mm is None:
_release_instance()
if model_manager.backend is not None:
current_manager = model_manager
else:
raise HTTPException(status_code=503, detail="Model not loaded")
else:
current_manager = mm
raise HTTPException(status_code=503,
detail=f"Model '{requested_model}' could not be loaded after waiting. "
"Another model may be using all available VRAM.")
current_manager = mm
# Inject system prompt if --system-prompt flag was provided
messages = request.messages
......@@ -1161,6 +1203,7 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
tool_parser,
request.response_format,
_prefix_key,
enable_thinking=reasoning_enabled,
):
yield chunk
finally:
......@@ -1182,6 +1225,7 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
tool_parser,
request.response_format,
force_reasoning_args,
enable_thinking=reasoning_enabled,
)
finally:
_release_instance()
......@@ -1198,6 +1242,7 @@ async def stream_chat_response(
tool_parser: ToolCallParser,
response_format: Optional[Dict] = None,
prefix_key: str = "",
enable_thinking: bool = False,
) -> AsyncGenerator[str, None]:
"""Stream chat completion response with queue notifications."""
completion_id = f"chatcmpl-{uuid.uuid4().hex}"
......@@ -1327,6 +1372,7 @@ async def stream_chat_response(
stop=stop,
tools=tools,
response_format=response_format,
enable_thinking=enable_thinking,
):
chunk_count += 1
# Always filter malformed content (regex-based, works per-chunk)
......@@ -1547,6 +1593,7 @@ async def generate_chat_response(
tool_parser: ToolCallParser,
response_format: Optional[Dict] = None,
force_reasoning_args: Optional[List[str]] = None,
enable_thinking: bool = False,
) -> Dict:
"""Generate non-streaming chat completion response."""
completion_id = f"chatcmpl-{uuid.uuid4().hex}"
......@@ -1583,6 +1630,7 @@ async def generate_chat_response(
stop=stop,
tools=tools,
response_format=response_format,
enable_thinking=enable_thinking,
)
# Always filter out malformed content
......@@ -1748,9 +1796,12 @@ async def completions(request: CompletionRequest):
requested_model = request.model
# Use the manager to resolve the model and manage VRAM (handles ondemand unloading)
model_info = multi_model_manager.request_model(
# In a thread: request_model may block (thermal cooldown / waiting for a busy
# model) and we must not stall the event loop.
model_info = await asyncio.to_thread(
multi_model_manager.request_model,
requested_model=requested_model,
model_type="text"
model_type="text",
)
# Check if the model was rejected as not allowed
......
......@@ -18,6 +18,7 @@
Audio transcription endpoint for the codai API.
"""
import asyncio
import io
import os
import tempfile
......@@ -143,7 +144,8 @@ async def create_transcription(
else multi_model_manager.whisper_servers.get(model)
)
if whisper_server is not None:
multi_model_manager.request_model(requested_model=model, model_type="audio")
await asyncio.to_thread(
multi_model_manager.request_model, requested_model=model, model_type="audio")
if not whisper_server.is_running():
whisper_server.start(
getattr(whisper_server, "_model_path", None),
......@@ -166,7 +168,8 @@ async def create_transcription(
return _format_response(response_format, result.get("text", ""), [])
# Use the manager to resolve the model and manage VRAM
model_info = multi_model_manager.request_model(
model_info = await asyncio.to_thread(
multi_model_manager.request_model,
requested_model=model,
model_type="audio"
)
......
......@@ -99,9 +99,10 @@ async def create_speech(request: TTSRequest, http_request: Request = None):
return {"audio": audio_base64}
# Use the manager to resolve the model and manage VRAM
model_info = multi_model_manager.request_model(
model_info = await asyncio.to_thread(
multi_model_manager.request_model,
requested_model=request.model,
model_type="tts"
model_type="tts",
)
# Check if the model was rejected as not allowed
......
......@@ -44,6 +44,31 @@ except (ImportError, AttributeError):
_grammar_guided_gen = False
def _make_thermal_criteria():
"""A StoppingCriteria that pauses generation while the CPU/GPU is too hot.
It runs ON the generation thread (between token forward passes), so blocking
here actually pauses GPU work — unlike the streamer consumer loop, which is
decoupled. Returns False so it never ends generation; throttled so it doesn't
read sensors on every token. Returns None if transformers is unavailable.
"""
try:
from transformers import StoppingCriteria
except Exception:
return None
class _ThermalPause(StoppingCriteria):
def __call__(self, input_ids, scores, **kwargs):
try:
from codai.models.thermal import checkpoint
checkpoint(context="text-gen", throttle_seconds=2.0)
except Exception:
pass
return False
return _ThermalPause()
class NvidiaBackend(ModelBackend):
"""Backend for NVIDIA GPUs using HuggingFace Transformers."""
......@@ -201,6 +226,36 @@ class NvidiaBackend(ModelBackend):
raise e
raise
def _make_bnb_config(self, model_name: str, load_in_4bit: bool, load_in_8bit: bool):
"""Build a transformers BitsAndBytesConfig (the modern quant API).
Passing load_in_4bit/load_in_8bit as direct from_pretrained kwargs is
removed in recent transformers and raises TypeError — which previously
forced a silent fallback to FULL-PRECISION loading (the model then no
longer fit on the GPU, offloaded to CPU, and leaked VRAM on eviction).
Always go through quantization_config instead.
"""
ml = model_name.lower()
if 'qwen3.5' in ml and ('a3b' in ml or 'moe' in ml):
print(f"Warning: {model_name} does not support bitsandbytes quantization")
return None
try:
import bitsandbytes as bnb # noqa: F401
import torch
from transformers import BitsAndBytesConfig
except ImportError:
print("Warning: bitsandbytes not installed. Quantization disabled.")
return None
print(f"Using {4 if load_in_4bit else 8}-bit quantization")
if load_in_4bit:
return BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)
return BitsAndBytesConfig(load_in_8bit=True)
def _is_moe_model(self, model_name: str) -> bool:
"""Check if model is a MoE model."""
moe_indicators = ['moe', 'mixtral', 'qwen3_5_moe', 'qwen3.5_moe', 'expert', 'a3b']
......@@ -317,7 +372,8 @@ class NvidiaBackend(ModelBackend):
flash_attn = kwargs.get('flash_attn', False)
offload_strategy = kwargs.get('offload_strategy', 'auto')
max_gpu_percent = kwargs.get('max_gpu_percent', None)
expected_vram_gb = kwargs.get('expected_vram_gb') or 0
# Check for --no-ram mode
no_ram = kwargs.get('no_ram', False)
if not no_ram:
......@@ -328,12 +384,37 @@ class NvidiaBackend(ModelBackend):
no_ram = True
except Exception:
pass
self._pending_ram_gb = manual_ram_gb
print(f"Loading HuggingFace model: {model_name}")
self.use_flash_attn = flash_attn
# Flash-Attention-2 requires the ENTIRE model resident on a single CUDA
# device. If the model will be split across GPU+CPU (offloading), FA2
# triggers a device-side assert that corrupts the whole CUDA context.
# So FA2 is only safe when the model fits fully in free GPU VRAM, or the
# user forced full-GPU residence (no_ram / offload_strategy='none').
self._fa2_safe = True
if flash_attn:
_full_gpu_forced = no_ram or offload_strategy == 'none'
if not _full_gpu_forced:
try:
import torch as _t
if _t.cuda.is_available() and expected_vram_gb > 0:
_free, _ = _t.cuda.mem_get_info(0)
_free_gb = _free / 1e9
# expected_vram_gb already includes ~15% overhead; the
# model must fit entirely on GPU for FA2 to be safe.
if expected_vram_gb > _free_gb:
self._fa2_safe = False
print(f" Flash Attention 2 disabled: model needs "
f"~{expected_vram_gb:.1f} GB but only {_free_gb:.1f} GB "
f"GPU free → will offload to CPU (FA2 needs full-GPU "
f"residence). Using SDPA instead.")
except Exception:
pass
self.use_flash_attn = flash_attn and self._fa2_safe
self.check_flash_attn_support()
self.device = self._detect_device()
......@@ -368,16 +449,9 @@ class NvidiaBackend(ModelBackend):
# Still allow quantization in no-ram mode (reduces VRAM usage)
if load_in_4bit or load_in_8bit:
if 'qwen3.5' in model_name.lower() and ('a3b' in model_name.lower() or 'moe' in model_name.lower()):
print(f" Warning: {model_name} does not support bitsandbytes quantization")
else:
try:
import bitsandbytes as bnb
print(f" Using {4 if load_in_4bit else 8}-bit quantization")
load_kwargs['load_in_4bit'] = load_in_4bit
load_kwargs['load_in_8bit'] = load_in_8bit
except ImportError:
print(" Warning: bitsandbytes not installed. Quantization disabled.")
_qc = self._make_bnb_config(model_name, load_in_4bit, load_in_8bit)
if _qc is not None:
load_kwargs['quantization_config'] = _qc
try:
model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
......@@ -404,17 +478,10 @@ class NvidiaBackend(ModelBackend):
load_kwargs = {'trust_remote_code': True}
if load_in_4bit or load_in_8bit:
if 'qwen3.5' in model_name.lower() and ('a3b' in model_name.lower() or 'moe' in model_name.lower()):
print(f"Warning: {model_name} does not support bitsandbytes quantization")
else:
try:
import bitsandbytes as bnb
print(f"Using {4 if load_in_4bit else 8}-bit quantization")
load_kwargs['load_in_4bit'] = load_in_4bit
load_kwargs['load_in_8bit'] = load_in_8bit
except ImportError:
print("Warning: bitsandbytes not installed. Quantization disabled.")
_qc = self._make_bnb_config(model_name, load_in_4bit, load_in_8bit)
if _qc is not None:
load_kwargs['quantization_config'] = _qc
if self.device == "cuda":
load_kwargs['dtype'] = torch.float16
else:
......@@ -427,7 +494,12 @@ class NvidiaBackend(ModelBackend):
if self.use_flash_attn and self.flash_attn_available:
load_kwargs['attn_implementation'] = "flash_attention_2"
print("Using Flash Attention 2")
else:
# SDPA safely handles GPU+CPU split models and still uses flash
# kernels for the GPU-resident layers — the safe default when the
# model is offloaded (FA2 would device-side-assert here).
load_kwargs['attn_implementation'] = "sdpa"
model = None
vram_percentages = self._get_vram_percentages_for_gpu(model_name, offload_strategy, max_gpu_percent)
......@@ -450,40 +522,86 @@ class NvidiaBackend(ModelBackend):
)
else:
first_vram_pct = vram_percentages[0] if vram_percentages else 0.93
for vram_pct in vram_percentages:
if self.device != "cuda":
load_kwargs['device_map'] = None
print("Loading model in CPU-only mode...")
model = self._try_load_model(model_name, load_kwargs, self.device)
if model is not None:
break
# No CUDA device — go straight to CPU+disk loading below.
break
if vram_pct > 0:
# Build max_memory: GPU budget capped at actual FREE VRAM so
# we never try to allocate more than what's physically available.
# Excess layers overflow to CPU RAM automatically via device_map.
max_memory = self._get_gpu_memory_map_with_limit(vram_pct)
load_kwargs['max_memory'] = max_memory
load_kwargs['device_map'] = 'auto'
print(f"\nTrying with GPU limit: {vram_pct*100:.0f}% VRAM")
_gpu_gb = max_memory.get(0, 0) / 1e9
_cpu_gb = max_memory.get('cpu', 0) / 1e9
print(f"\nTrying GPU {_gpu_gb:.1f} GB + CPU {_cpu_gb:.1f} GB"
f" (device_map=auto, {vram_pct*100:.0f}% VRAM cap)")
model = self._try_load_model(model_name, load_kwargs, self.device)
if model is not None:
print(f" ✓ Model loaded successfully with {vram_pct*100:.0f}% GPU VRAM limit")
print(f" ✓ Model loaded — GPU {_gpu_gb:.1f} GB / CPU {_cpu_gb:.1f} GB")
if vram_pct < first_vram_pct:
print(f" (Reduced from {first_vram_pct*100:.0f}% due to memory constraints)")
print(f" (Reduced GPU cap from {first_vram_pct*100:.0f}%"
f" due to memory constraints)")
break
else:
print(f" ✗ Out of memory with {vram_pct*100:.0f}% GPU VRAM, trying lower limit...")
print(f" ✗ OOM at GPU {_gpu_gb:.1f} GB, trying lower GPU cap…")
if torch.cuda.is_available():
torch.cuda.empty_cache()
else:
print("\nFalling back to CPU-only mode...")
load_kwargs['max_memory'] = {0: 0, 'cpu': int((manual_ram_gb or 48) * 1e9)}
load_kwargs['device_map'] = 'auto'
model = self._try_load_model(model_name, load_kwargs, "cpu")
# vram_pct == 0: GPU (all free VRAM) + CPU RAM + disk overflow.
# Use every byte of GPU that's free, then spill to CPU RAM, then
# disk — NEVER leave GPU idle when loading this fallback level.
import psutil as _psutil
_free_vram = 0
if torch.cuda.is_available():
try:
_free_vram, _ = torch.cuda.mem_get_info(0)
except Exception:
pass
_headroom = 512 * 1024 * 1024
_gpu_budget = max(0, _free_vram - _headroom)
_free_ram = _psutil.virtual_memory().available
_cpu_budget = max(int(2e9), int(_free_ram * 0.80))
_disk_dir = offload_dir or os.path.join(
os.path.expanduser('~'), '.cache', 'coderai', 'offload')
os.makedirs(_disk_dir, exist_ok=True)
print(f"\nGPU {_gpu_budget/1e9:.1f} GB + CPU {_cpu_budget/1e9:.1f} GB"
f" + disk ({_disk_dir})")
_spill_kwargs = {
**load_kwargs,
'device_map': 'auto',
'max_memory': {0: _gpu_budget, 'cpu': _cpu_budget},
'offload_folder': _disk_dir,
'offload_buffers': True,
}
model = self._try_load_model(model_name, _spill_kwargs, self.device)
if model is not None:
print(" ✓ Model loaded successfully on CPU")
print(f" ✓ Model loaded — GPU {_gpu_budget/1e9:.1f} GB"
f" / CPU {_cpu_budget/1e9:.1f} GB / disk overflow")
break
# Absolute last resort: pure CPU without device_map.
# Only reached when CUDA is unavailable or all GPU+RAM+disk paths failed.
# Uses device_map=None to avoid accelerate hooks that assume CUDA.
if model is None:
print("\nFalling back to pure CPU (no GPU available)…")
cpu_kwargs = {
'trust_remote_code': True,
'torch_dtype': torch.float32,
'low_cpu_mem_usage': True,
}
if offload_dir:
cpu_kwargs['offload_folder'] = offload_dir
if self.use_flash_attn and self.flash_attn_available:
cpu_kwargs['attn_implementation'] = "flash_attention_2"
model = self._try_load_model(model_name, cpu_kwargs, "cpu")
if model is not None:
print(" ✓ Model loaded on CPU (no GPU)")
if model is None:
raise RuntimeError("Failed to load model: Out of memory even with minimum GPU usage")
......@@ -499,17 +617,29 @@ class NvidiaBackend(ModelBackend):
print(f"Model capabilities: {caps}")
def _get_gpu_memory_map_with_limit(self, vram_fraction: float) -> Dict:
"""Get max_memory dict with specified VRAM fraction limit."""
"""Get max_memory dict for device_map='auto'.
GPU budget = min(total × fraction, free − 512 MB headroom).
Capping at free VRAM ensures we never ask accelerate to allocate more
than what's physically available; layers that exceed the GPU budget
spill to CPU RAM automatically via device_map.
"""
import torch
max_memory = {}
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
total_vram = props.total_memory
usable_vram = int(total_vram * vram_fraction)
max_memory[i] = usable_vram
try:
free_vram, _ = torch.cuda.mem_get_info(i)
except Exception:
free_vram = total_vram
headroom = 512 * 1024 * 1024 # 512 MB for CUDA driver overhead
limit_by_fraction = int(total_vram * vram_fraction)
limit_by_free = max(0, free_vram - headroom)
max_memory[i] = min(limit_by_fraction, limit_by_free)
manual_ram_gb = getattr(self, '_pending_ram_gb', None)
if manual_ram_gb:
max_memory['cpu'] = int(manual_ram_gb * 1e9)
......@@ -518,7 +648,7 @@ class NvidiaBackend(ModelBackend):
available_ram = psutil.virtual_memory().available
usable_ram = max(0, available_ram - int(4e9))
max_memory['cpu'] = usable_ram
return max_memory
def format_messages(self, messages: List[ChatMessage]) -> str:
......@@ -835,19 +965,24 @@ class NvidiaBackend(ModelBackend):
if repeat_penalty != 1.0:
generation_kwargs["repetition_penalty"] = repeat_penalty
# Mid-generation thermal checkpoint (runs on the generate thread).
_criteria = []
_therm = _make_thermal_criteria()
if _therm is not None:
_criteria.append(_therm)
if stop:
class StopOnSequence(StoppingCriteria):
def __init__(self, stop_sequences, tokenizer):
self.stop_sequences = stop_sequences
self.tokenizer = tokenizer
def __call__(self, input_ids, scores, **kwargs):
decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True)
return any(seq in decoded for seq in self.stop_sequences)
generation_kwargs["stopping_criteria"] = StoppingCriteriaList([
StopOnSequence(stop, self.tokenizer)
])
_criteria.append(StopOnSequence(stop, self.tokenizer))
if _criteria:
generation_kwargs["stopping_criteria"] = StoppingCriteriaList(_criteria)
generation_error = None
......@@ -890,9 +1025,19 @@ class NvidiaBackend(ModelBackend):
_time.time() - self._kv_timestamp < self._kv_ttl
)
def _model_on_cuda(self) -> bool:
"""Return True only when the model's first parameter is actually on a CUDA device."""
try:
return next(self.model.parameters()).is_cuda
except StopIteration:
return False
def _build_kv_prefix(self, prefix_text: str):
"""Forward-pass on prefix_text to populate the KV state."""
import torch
# KV prefix caching requires CUDA tensors; skip on CPU-mode models.
if not self._model_on_cuda():
raise RuntimeError("KV prefix cache requires CUDA; model is on CPU")
inputs = self.tokenizer(
prefix_text, return_tensors="pt", add_special_tokens=False
)
......@@ -910,6 +1055,8 @@ class NvidiaBackend(ModelBackend):
def invalidate_kv_cache(self) -> None:
"""Discard the cached KV state (call on model unload/swap)."""
self._kv_prefix_text = None
if self._kv_past_key_values is not None:
del self._kv_past_key_values
self._kv_past_key_values = None
self._kv_prefix_len = 0
self._kv_timestamp = 0.0
......@@ -934,8 +1081,67 @@ class NvidiaBackend(ModelBackend):
]
return self.format_messages(chat_msgs)
def _eos_token_ids(self):
"""All token ids that should END generation — including the chat turn
boundary. Qwen's turn ends with <|im_end|>, but tokenizer.eos_token_id is
<|endoftext|>; without im_end the model never stops and hallucinates extra
'assistant'/'user' turns. Returns a list (HF generate accepts a list)."""
ids = set()
try:
if self.tokenizer.eos_token_id is not None:
ids.add(int(self.tokenizer.eos_token_id))
except Exception:
pass
for tok in ('<|im_end|>', '<|eot_id|>', '<|end|>', '<|endoftext|>',
'<|end_of_text|>', '<end_of_turn>'):
try:
tid = self.tokenizer.convert_tokens_to_ids(tok)
if isinstance(tid, int) and tid >= 0 and tid != getattr(
self.tokenizer, 'unk_token_id', None):
ids.add(tid)
except Exception:
pass
return list(ids) if ids else self.tokenizer.eos_token_id
def _build_chat_prompt(self, messages, enable_thinking: bool = False,
add_generation_prompt: bool = True) -> str:
"""Build the prompt string using the MODEL's own chat template when it has
one (correct special tokens + proper `enable_thinking` handling for Qwen3).
Falls back to the legacy custom formatter when no template is available.
`enable_thinking=True` keeps reasoning <think> blocks available for callers
that ask for them; `False` (default) suppresses them via the template.
"""
tmpl = getattr(self.tokenizer, 'chat_template', None)
if tmpl:
# Normalise to plain {role, content} dicts for apply_chat_template.
norm = []
for m in messages:
if isinstance(m, dict):
norm.append({'role': m.get('role'), 'content': m.get('content') or ''})
else:
norm.append({'role': getattr(m, 'role', None),
'content': getattr(m, 'content', '') or ''})
try:
return self.tokenizer.apply_chat_template(
norm, tokenize=False,
add_generation_prompt=add_generation_prompt,
enable_thinking=enable_thinking)
except TypeError:
# Tokenizer's template doesn't accept enable_thinking — use plain.
try:
return self.tokenizer.apply_chat_template(
norm, tokenize=False,
add_generation_prompt=add_generation_prompt)
except Exception:
pass
except Exception:
pass
return self._format_messages_to_str(messages)
def generate_chat(self, messages, max_tokens=None, temperature=0.7,
top_p=1.0, stop=None, tools=None, response_format=None) -> str:
top_p=1.0, stop=None, tools=None, response_format=None,
enable_thinking=False) -> str:
"""
Non-streaming chat generation with KV prefix caching.
......@@ -947,7 +1153,8 @@ class NvidiaBackend(ModelBackend):
if max_tokens is None:
max_tokens = 512
full_prompt = self._format_messages_to_str(messages)
full_prompt = self._build_chat_prompt(messages, enable_thinking=enable_thinking,
add_generation_prompt=True)
total_input_ids = self.tokenizer(full_prompt, return_tensors="pt")['input_ids']
total_prompt_len = int(total_input_ids.shape[1])
......@@ -961,8 +1168,9 @@ class NvidiaBackend(ModelBackend):
past_kv = None
cached_len = 0
if prefix_msgs:
prefix_text = self._format_messages_to_str(prefix_msgs)
if prefix_msgs and self._model_on_cuda():
prefix_text = self._build_chat_prompt(
prefix_msgs, enable_thinking=enable_thinking, add_generation_prompt=False)
if self._kv_cache_valid() and self._kv_prefix_text == prefix_text:
past_kv = self._kv_past_key_values
cached_len = self._kv_prefix_len
......@@ -981,9 +1189,15 @@ class NvidiaBackend(ModelBackend):
top_p=top_p if do_sample else None,
do_sample=do_sample,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
eos_token_id=self._eos_token_ids(),
use_cache=True,
)
# Mid-generation thermal checkpoint (runs on the generate thread, so it
# pauses GPU work between tokens when the CPU/GPU is too hot).
_therm = _make_thermal_criteria()
if _therm is not None:
from transformers import StoppingCriteriaList
gen_kwargs["stopping_criteria"] = StoppingCriteriaList([_therm])
generated_text = ""
try:
......@@ -1014,7 +1228,12 @@ class NvidiaBackend(ModelBackend):
generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
except Exception as e:
print(f"Warning: KV-cached generate_chat failed ({e}), retrying without cache")
self.invalidate_kv_cache()
cached_len = 0
# Determine if the error is a CUDA device-placement issue; if so, also
# disable the internal KV cache which accumulates mixed-device tensors.
_is_device_error = "is_cuda" in str(e) or "device" in str(e).lower()
_fallback_kwargs = {**gen_kwargs, 'use_cache': not _is_device_error}
try:
total_input_ids = self.tokenizer(
full_prompt, return_tensors="pt"
......@@ -1024,13 +1243,34 @@ class NvidiaBackend(ModelBackend):
outputs = self.model.generate(
input_ids=total_input_ids,
attention_mask=attn_mask,
**gen_kwargs,
**_fallback_kwargs,
)
new_tokens = outputs[0][total_prompt_len:]
generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
except Exception as e2:
print(f"Error: generate_chat fallback failed: {e2}")
generated_text = ""
# Last resort: disable internal KV cache entirely
if _fallback_kwargs.get('use_cache', True):
try:
no_cache_kwargs = {**gen_kwargs, 'use_cache': False}
total_input_ids = self.tokenizer(
full_prompt, return_tensors="pt"
)['input_ids'].to(self.model.device)
attn_mask = torch.ones_like(total_input_ids)
with torch.no_grad():
outputs = self.model.generate(
input_ids=total_input_ids,
attention_mask=attn_mask,
**no_cache_kwargs,
)
new_tokens = outputs[0][total_prompt_len:]
generated_text = self.tokenizer.decode(new_tokens, skip_special_tokens=True)
print("generate_chat: recovered with use_cache=False")
except Exception as e3:
print(f"Error: generate_chat no-cache fallback failed: {e3}")
generated_text = ""
else:
generated_text = ""
try:
comp_len = len(self.tokenizer.encode(generated_text)) if generated_text else 0
......@@ -1046,7 +1286,8 @@ class NvidiaBackend(ModelBackend):
async def generate_chat_stream(self, messages, max_tokens=None,
temperature=0.7, top_p=1.0, stop=None,
tools=None, response_format=None):
tools=None, response_format=None,
enable_thinking=False):
"""
Streaming chat generation with KV prefix caching.
Uses the same prefix-cache strategy as generate_chat.
......@@ -1058,7 +1299,8 @@ class NvidiaBackend(ModelBackend):
if max_tokens is None:
max_tokens = 512
full_prompt = self._format_messages_to_str(messages)
full_prompt = self._build_chat_prompt(messages, enable_thinking=enable_thinking,
add_generation_prompt=True)
total_input_ids = self.tokenizer(full_prompt, return_tensors="pt")['input_ids']
total_prompt_len = int(total_input_ids.shape[1])
......@@ -1070,8 +1312,9 @@ class NvidiaBackend(ModelBackend):
past_kv = None
cached_len = 0
if prefix_msgs:
prefix_text = self._format_messages_to_str(prefix_msgs)
if prefix_msgs and self._model_on_cuda():
prefix_text = self._build_chat_prompt(
prefix_msgs, enable_thinking=enable_thinking, add_generation_prompt=False)
if self._kv_cache_valid() and self._kv_prefix_text == prefix_text:
past_kv = self._kv_past_key_values
cached_len = self._kv_prefix_len
......@@ -1109,11 +1352,16 @@ class NvidiaBackend(ModelBackend):
do_sample=do_sample,
streamer=streamer,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
eos_token_id=self._eos_token_ids(),
use_cache=True,
**extra_gen,
)
# Mid-generation thermal checkpoint (runs on the generate thread).
_criteria = []
_therm = _make_thermal_criteria()
if _therm is not None:
_criteria.append(_therm)
if stop:
class _StopOnSeq(StoppingCriteria):
def __init__(self, seqs, tok):
......@@ -1122,9 +1370,9 @@ class NvidiaBackend(ModelBackend):
def __call__(self, input_ids, scores, **kw):
decoded = self.tok.decode(input_ids[0][-20:], skip_special_tokens=True)
return any(s in decoded for s in self.seqs)
gen_kwargs['stopping_criteria'] = StoppingCriteriaList(
[_StopOnSeq(stop, self.tokenizer)]
)
_criteria.append(_StopOnSeq(stop, self.tokenizer))
if _criteria:
gen_kwargs['stopping_criteria'] = StoppingCriteriaList(_criteria)
gen_error = [None]
comp_tokens = [0]
......@@ -1155,21 +1403,206 @@ class NvidiaBackend(ModelBackend):
if gen_error[0]:
print(f"Warning: KV-cached stream generation error: {gen_error[0]}")
self.invalidate_kv_cache()
def get_model_name(self) -> str:
return self.model_name or "unknown"
def cleanup(self) -> None:
import torch
import torch, gc
try:
from codai.api.state import get_global_debug
_dbg = bool(get_global_debug())
except Exception:
_dbg = False
def _vram_gb():
try:
if torch.cuda.is_available():
free, total = torch.cuda.mem_get_info()
return (total - free) / 1e9
except Exception:
pass
return -1.0
def _cuda_param_gb():
tot = 0
try:
for p in self.model.parameters():
if p.data.is_cuda:
tot += p.data.numel() * p.data.element_size()
for b in self.model.buffers():
if b.data.is_cuda:
tot += b.data.numel() * b.data.element_size()
except Exception:
pass
return tot / 1e9
_v0 = _vram_gb()
self.invalidate_kv_cache()
if self.model is not None:
_pg0 = _cuda_param_gb() if _dbg else 0.0
# Record the GPU storage pointers of THIS model's tensors so we can,
# after moving them to CPU, break any lingering external references
# (e.g. accelerate's tied_params_map, which keeps tied embedding /
# lm_head weights alive on the GPU and fragments the allocator so
# empty_cache() can't release the surrounding memory). Scoped by
# data_ptr so we never touch a different (coexisting) model.
_orig_cuda_ptrs = set()
try:
for _p in self.model.parameters():
if _p.data.is_cuda:
_orig_cuda_ptrs.add(_p.data.untyped_storage().data_ptr())
for _b in self.model.buffers():
if _b.data.is_cuda:
_orig_cuda_ptrs.add(_b.data.untyped_storage().data_ptr())
except Exception:
pass
# Strip accelerate dispatch hooks AND their offload bookkeeping, which
# hold references to the original CUDA tensors. Must happen before we
# move tensors, or the hooks keep the GPU copies alive.
try:
from accelerate.hooks import remove_hook_from_submodules
remove_hook_from_submodules(self.model)
except Exception:
pass
# Walk every submodule and move its raw _parameters/_buffers storage to
# CPU directly. This reaches tensors that model.parameters() may skip
# (e.g. when wrapped by accelerate) and does NOT rely on model.to('cpu'),
# which is a silent no-op on dispatched models.
try:
import torch as _t
for _mod in self.model.modules():
for _d in (_mod._parameters, _mod._buffers):
for _name, _t_obj in list(_d.items()):
if _t_obj is None:
continue
try:
if getattr(_t_obj, 'is_cuda', False):
_d[_name] = _t_obj.to('cpu')
# accelerate stores params as nn.Parameter; keep type
elif hasattr(_t_obj, 'data') and getattr(_t_obj.data, 'is_cuda', False):
_t_obj.data = _t_obj.data.to('cpu')
except Exception:
pass
# Drop per-module accelerate hook state that pins CUDA tensors.
for _attr in ('_hf_hook', '_old_forward'):
if hasattr(_mod, _attr):
try:
delattr(_mod, _attr)
except Exception:
pass
except Exception as e:
print(f" cleanup: module-walk move issue: {e}")
for _attr in ('hf_device_map', '_hf_hook'):
try:
if hasattr(self.model, _attr):
delattr(self.model, _attr)
except Exception:
pass
if _dbg:
print(f" cleanup: CUDA param bytes {_pg0:.1f} → {_cuda_param_gb():.1f} GB")
del self.model
del self.tokenizer
self.model = None
# Break lingering references to THIS model's original GPU tensors that
# outlive the model (accelerate tied_params_map lists, stray caches).
# Only tensors whose storage pointer we recorded above are touched, so
# other models loaded alongside are never affected.
if _orig_cuda_ptrs:
try:
broken = 0
for obj in gc.get_objects():
if not (isinstance(obj, torch.Tensor) and obj.is_cuda):
continue
try:
if obj.untyped_storage().data_ptr() not in _orig_cuda_ptrs:
continue
except Exception:
continue
# Null this tensor out of any list/dict that still holds it.
for ref in gc.get_referrers(obj):
try:
if isinstance(ref, list):
for i, it in enumerate(ref):
if it is obj:
ref[i] = None
broken += 1
elif isinstance(ref, dict):
for k, v in list(ref.items()):
if v is obj:
ref[k] = None
broken += 1
except Exception:
pass
if _dbg and broken:
print(f" cleanup: broke {broken} external GPU-tensor reference(s)")
except Exception:
pass
if self.tokenizer is not None:
del self.tokenizer
self.tokenizer = None
if torch.cuda.is_available():
torch.cuda.empty_cache()
# Force Python GC before emptying the CUDA allocator pool so that all
# Python-held tensor references (closures, local vars, etc.) are dropped.
for _ in range(3):
gc.collect()
if torch.cuda.is_available():
torch.cuda.synchronize()
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Release the model's host-side memory back to the OS (and any swap it
# was paged into) so RSS doesn't creep up across model swaps.
try:
from codai.models.manager import _trim_cpu_ram
_trim_cpu_ram()
except Exception:
pass
_v1 = _vram_gb()
if _v0 >= 0 and _v1 >= 0:
print(f" cleanup: freed {_v0 - _v1:.1f} GB VRAM (now {_v1:.1f} GB used)")
if _dbg:
try:
_alloc = torch.cuda.memory_allocated() / 1e9
_resv = torch.cuda.memory_reserved() / 1e9
print(f" cleanup: torch allocated={_alloc:.1f} GB "
f"reserved={_resv:.1f} GB (driver used={_v1:.1f} GB)")
except Exception:
pass
# If a large chunk is still resident, name what's holding CUDA tensors.
if (_v0 - _v1) < 1.0 and _v1 > 2.0:
try:
biggest = []
total = 0.0
seen = set()
for obj in gc.get_objects():
try:
if isinstance(obj, torch.Tensor) and obj.is_cuda:
if id(obj) in seen:
continue
seen.add(id(obj))
gb = obj.numel() * obj.element_size() / 1e9
total += gb
if gb > 0.05:
rtypes = []
for r in gc.get_referrers(obj)[:4]:
rt = type(r).__name__
if rt == 'dict':
try:
rt = f"dict{list(r.keys())[:3]}"
except Exception:
pass
rtypes.append(rt)
biggest.append((gb, tuple(obj.shape), rtypes))
except Exception:
continue
biggest.sort(reverse=True)
print(f" cleanup-leak: {total:.1f} GB still in CUDA tensors; top holders:")
for gb, shape, rtypes in biggest[:6]:
print(f" {gb:.2f} GB shape={shape} referrers={rtypes}")
except Exception as e:
print(f" cleanup-leak scan failed: {e}")
def get_context_size(self) -> int:
"""Return the model's context window size."""
if self.model is not None and hasattr(self.model, 'config'):
......
......@@ -38,6 +38,34 @@ try:
except (ImportError, AttributeError):
_grammar_guided_gen = False
def _make_llama_thermal_criteria():
"""A llama.cpp StoppingCriteriaList that pauses generation while too hot.
llama-cpp-python evaluates stopping criteria synchronously per token inside
create_(chat_)completion, so blocking here pauses the GPU forward pass —
mid-generation thermal protection for the GGUF/Vulkan/llama.cpp backend.
The criterion never stops generation (returns False) and is throttled so it
doesn't read sensors on every token. Returns None if unavailable.
"""
try:
from llama_cpp import StoppingCriteriaList
except Exception:
return None
def _pause(input_ids, logits):
try:
from codai.models.thermal import checkpoint
checkpoint(context="text-gen", throttle_seconds=2.0)
except Exception:
pass
return False
try:
return StoppingCriteriaList([_pause])
except Exception:
return None
try:
from llama_cpp import Llama
from llama_cpp.llama_chat_format import ChatFormatterResponse
......@@ -699,6 +727,7 @@ class VulkanBackend(ModelBackend):
try:
result = self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -717,6 +746,7 @@ class VulkanBackend(ModelBackend):
print(f"Warning: Grammar-guided generation failed: {e}, falling back to normal generation")
try:
result = self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -803,6 +833,7 @@ class VulkanBackend(ModelBackend):
prompt_len = len(prompt) if isinstance(prompt, str) else 0
for chunk in self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -842,6 +873,7 @@ class VulkanBackend(ModelBackend):
prompt_len = len(prompt) if isinstance(prompt, str) else 0
for chunk in self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -911,6 +943,7 @@ class VulkanBackend(ModelBackend):
prompt_len = len(prompt)
for chunk in self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -937,6 +970,7 @@ class VulkanBackend(ModelBackend):
return {"stream": generate_stream(), "content": ""}
else:
result = self.model.create_completion(
stopping_criteria=_make_llama_thermal_criteria(),
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
......@@ -1052,6 +1086,9 @@ class VulkanBackend(ModelBackend):
kwargs['stop'] = stop
if response_format and response_format.get('type') == 'json_object':
kwargs['response_format'] = {'type': 'json_object'}
_tc = _make_llama_thermal_criteria()
if _tc is not None:
kwargs['stopping_criteria'] = _tc
result = self.model.create_chat_completion(**kwargs)
usage = result.get('usage', {})
......@@ -1077,6 +1114,9 @@ class VulkanBackend(ModelBackend):
)
if stop:
kwargs['stop'] = stop
_tc = _make_llama_thermal_criteria()
if _tc is not None:
kwargs['stopping_criteria'] = _tc
prompt_tokens = 0
completion_tokens = 0
......
......@@ -108,6 +108,24 @@ class ArchiveConfig:
retention: str = "never" # one of: 1h 1d 2d 1w 1m 3m 6m 1y never
@dataclass
class ThermalConfig:
"""Thermal-protection configuration.
Before running a request against a loaded model, wait until CPU/GPU
temperatures are within safe limits so a long sequence of heavy
generations can't overheat the machine and trip its power-off protection.
Thresholds are in degrees Celsius. CPU and GPU can be toggled separately.
"""
cpu_enabled: bool = True
gpu_enabled: bool = True
cpu_high: float = 90.0 # pause when CPU reaches this temperature
cpu_resume: float = 87.0 # resume once CPU drops back to/below this
gpu_high: float = 90.0 # pause when GPU reaches this temperature
gpu_resume: float = 87.0 # resume once GPU drops back to/below this
poll_seconds: float = 5.0 # how often to re-check while cooling down
@dataclass
class Config:
"""Main configuration class."""
......@@ -120,6 +138,7 @@ class Config:
image: ImageConfig = field(default_factory=ImageConfig)
whisper: WhisperConfig = field(default_factory=WhisperConfig)
archive: ArchiveConfig = field(default_factory=ArchiveConfig)
thermal: ThermalConfig = field(default_factory=ThermalConfig)
broker: BrokerConfig = field(default_factory=BrokerConfig)
system_prompt: Optional[str] = None
tools_closer_prompt: bool = False
......@@ -273,6 +292,7 @@ class ConfigManager:
image=ImageConfig(**config_data.get("image", {})),
whisper=WhisperConfig(**config_data.get("whisper", {})),
archive=ArchiveConfig(**config_data.get("archive", {})),
thermal=ThermalConfig(**config_data.get("thermal", {})),
broker=BrokerConfig(**config_data.get("broker", {})),
system_prompt=config_data.get("system_prompt"),
tools_closer_prompt=config_data.get("tools_closer_prompt", False),
......@@ -382,6 +402,15 @@ class ConfigManager:
"directory": self.config.archive.directory,
"retention": self.config.archive.retention,
},
"thermal": {
"cpu_enabled": self.config.thermal.cpu_enabled,
"gpu_enabled": self.config.thermal.gpu_enabled,
"cpu_high": self.config.thermal.cpu_high,
"cpu_resume": self.config.thermal.cpu_resume,
"gpu_high": self.config.thermal.gpu_high,
"gpu_resume": self.config.thermal.gpu_resume,
"poll_seconds": self.config.thermal.poll_seconds,
},
"broker": {
"enabled": self.config.broker.enabled,
"base_url": self.config.broker.base_url,
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
"""Shared HuggingFace/transformers loading helper.
Translates a coderai per-model configuration (the uniform models.json schema)
into ``from_pretrained`` kwargs so that EVERY transformers-based loader
(spatial, embedding, audio-gen, vision, …) honours the same quantization,
offload, flash-attention and memory settings as the text/image/video loaders.
The configuration is the single source of truth — nothing here reads CLI args.
"""
import os
from typing import Any, Dict, Optional
def _norm(cfg: Optional[Dict[str, Any]]) -> Dict[str, Any]:
"""Return the per-model config dict, unwrapping a forwarded `_raw_cfg`."""
if not cfg:
return {}
raw = cfg.get('_raw_cfg') if isinstance(cfg, dict) else None
merged = dict(cfg)
if isinstance(raw, dict):
# Raw entry fills any key the translated kwargs didn't set.
for k, v in raw.items():
merged.setdefault(k, v)
return merged
def resolve_dtype(cfg: Optional[Dict[str, Any]], default: str = 'bf16'):
"""Resolve torch dtype from the model's `precision` setting."""
import torch
precision = (_norm(cfg).get('precision') or default)
return {
'bf16': torch.bfloat16,
'f16': torch.float16,
'fp16': torch.float16,
'f32': torch.float32,
'fp32': torch.float32,
}.get(precision, torch.bfloat16 if default == 'bf16' else torch.float32)
def build_quantization_config(cfg: Optional[Dict[str, Any]]):
"""Build a transformers BitsAndBytesConfig from the model config, or None."""
c = _norm(cfg)
load_in_4bit = bool(c.get('load_in_4bit', False))
load_in_8bit = bool(c.get('load_in_8bit', False))
if not (load_in_4bit or load_in_8bit):
return None
try:
import torch
from transformers import BitsAndBytesConfig
if load_in_4bit:
return BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=resolve_dtype(cfg),
bnb_4bit_use_double_quant=True,
)
return BitsAndBytesConfig(load_in_8bit=True)
except Exception as e:
print(f" Quantization requested but unavailable: {e}")
return None
def _is_gguf_value(v) -> bool:
"""True if a component_quantization value points to a GGUF file (path/URL)."""
return isinstance(v, str) and v.strip().lower().endswith('.gguf')
def _normalize_quant_mode(mode) -> Optional[str]:
"""Normalize a quant mode string to '2bit' / '4bit' / '8bit' / None.
2-bit uses the quanto backend (optimum-quanto); 4/8-bit use bitsandbytes.
GGUF file values are handled separately (see build_gguf_pipeline_components).
"""
if mode in (None, '', 'none', 'off', False):
return None
if _is_gguf_value(mode):
return None # GGUF handled elsewhere
m = str(mode).lower().replace('-', '').replace('_', '').replace(' ', '')
if m in ('2bit', '2', 'int2', 'quanto2'):
return '2bit'
if m in ('4bit', '4', 'int4', 'nf4', 'bnb4'):
return '4bit'
if m in ('8bit', '8', 'int8', 'bnb8'):
return '8bit'
return None
def _discover_components(model_name: str) -> Dict[str, Any]:
"""Return {component_name: [library, class_name]} from the pipeline config."""
out: Dict[str, Any] = {}
try:
from diffusers import DiffusionPipeline
for name, spec in DiffusionPipeline.load_config(model_name).items():
if name.startswith('_'):
continue
if isinstance(spec, (list, tuple)) and spec:
out[name] = list(spec)
except Exception:
pass
return out
def build_pipeline_quant_config(model_name: str, cfg: Optional[Dict[str, Any]], dtype):
"""Build a diffusers PipelineQuantizationConfig from a per-model config.
Honours an optional per-component override map ``component_quantization``
(e.g. {"transformer": "4bit", "text_encoder": "8bit", "vae": "none"}).
Supported per-component modes:
- "4bit" / "8bit": bitsandbytes (default backend)
- "2bit": optimum-quanto (int2) — requires `pip install optimum-quanto`
- a "*.gguf" path/URL: handled by build_gguf_pipeline_components, NOT here
When the map is absent it falls back to the global ``load_in_4bit`` /
``load_in_8bit`` flag applied to all heavy components.
Returns ``(quant_config, description)`` or ``(None, '')``.
"""
c = _norm(cfg)
comp_q = c.get('component_quantization') or {}
global_4 = bool(c.get('load_in_4bit', False))
global_8 = bool(c.get('load_in_8bit', False))
if not comp_q and not (global_4 or global_8):
return None, ''
try:
from diffusers.quantizers import PipelineQuantizationConfig
from diffusers import BitsAndBytesConfig as DiffBnb
from transformers import BitsAndBytesConfig as TfBnb
except Exception as e:
print(f" Pipeline quantization unavailable: {e}")
return None, ''
comp_lib = {n: (s[0] if isinstance(s, list) and s else 'diffusers')
for n, s in _discover_components(model_name).items()}
def _is_heavy(name: str) -> bool:
return (name.startswith('transformer') or name == 'unet'
or name.startswith('text_encoder'))
def _quanto_cfg(lib: str):
# optimum-quanto int2 via the diffusers / transformers QuantoConfig.
import importlib.util
_have_quanto = False
try:
_have_quanto = importlib.util.find_spec('optimum.quanto') is not None
except Exception:
_have_quanto = False
if not _have_quanto:
print(" 2-bit requested but optimum-quanto is not installed — "
"run `pip install optimum-quanto`. Skipping (component stays "
"full precision).")
return None
try:
if lib == 'transformers':
from transformers import QuantoConfig as QC
else:
from diffusers import QuantoConfig as QC
return QC(weights='int2')
except Exception as e:
print(f" 2-bit (quanto) unavailable: {e}")
return None
def _mk(lib: str, mode: str):
if mode == '2bit':
return _quanto_cfg(lib)
BnB = TfBnb if lib == 'transformers' else DiffBnb
if mode == '4bit':
return BnB(load_in_4bit=True, bnb_4bit_quant_type='nf4',
bnb_4bit_compute_dtype=dtype, bnb_4bit_use_double_quant=True)
return BnB(load_in_8bit=True)
quant_mapping: Dict[str, Any] = {}
descs = []
if comp_q:
for name, raw_mode in comp_q.items():
mode = _normalize_quant_mode(raw_mode) # GGUF/none → None here
if mode is None:
continue
cfg_obj = _mk(comp_lib.get(name, 'diffusers'), mode)
if cfg_obj is not None:
quant_mapping[name] = cfg_obj
descs.append(f"{name}:{mode}")
else:
mode = '4bit' if global_4 else '8bit'
targets = [n for n in comp_lib if _is_heavy(n)] or \
['transformer', 'transformer_2', 'text_encoder', 'unet']
for name in targets:
cfg_obj = _mk(comp_lib.get(name, 'diffusers'), mode)
if cfg_obj is not None:
quant_mapping[name] = cfg_obj
descs.append(f"{name}:{mode}")
if not quant_mapping:
return None, ''
try:
return PipelineQuantizationConfig(quant_mapping=quant_mapping), ', '.join(descs)
except Exception as e:
print(f" Pipeline quantization build failed: {e}")
return None, ''
def build_gguf_pipeline_components(model_name: str, cfg: Optional[Dict[str, Any]], dtype):
"""Load pipeline components from GGUF files (Q2_K..Q8_0 — incl. 5/6-bit).
For each ``component_quantization`` entry whose value is a ``*.gguf`` path or
URL, load that component via ``<Class>.from_single_file(..., GGUFQuantization
Config)`` so it can be passed to the pipeline's ``from_pretrained`` as a
pre-built component (e.g. ``transformer=<model>``). The bit-width (Q5_K,
Q6_K, …) is embedded in the GGUF file itself.
Returns ``(components_dict, description)``; empty dict when none configured.
Only diffusers components (transformer*/unet/vae) are supported here;
GGUF text encoders are uncommon and skipped with a note.
"""
c = _norm(cfg)
comp_q = c.get('component_quantization') or {}
gguf_entries = {n: v for n, v in comp_q.items() if _is_gguf_value(v)}
if not gguf_entries:
return {}, ''
try:
import diffusers
from diffusers import GGUFQuantizationConfig
except Exception as e:
print(f" GGUF components unavailable: {e}")
return {}, ''
specs = _discover_components(model_name) # name -> [library, class_name]
components: Dict[str, Any] = {}
descs = []
for name, path in gguf_entries.items():
spec = specs.get(name)
if not spec or spec[0] != 'diffusers':
print(f" GGUF skip '{name}': only diffusers components supported "
f"(got {spec}).")
continue
cls_name = spec[1] if len(spec) > 1 else None
cls = getattr(diffusers, cls_name, None) if cls_name else None
if cls is None or not hasattr(cls, 'from_single_file'):
print(f" GGUF skip '{name}': no loadable class {cls_name}.")
continue
try:
print(f" Loading GGUF component '{name}' from {path}")
model = cls.from_single_file(
path.strip(),
quantization_config=GGUFQuantizationConfig(compute_dtype=dtype),
torch_dtype=dtype,
)
components[name] = model
descs.append(f"{name}:gguf")
except Exception as e:
print(f" GGUF load failed for '{name}' ({path}): {e}")
return components, ', '.join(descs)
def build_from_pretrained_kwargs(
cfg: Optional[Dict[str, Any]],
*,
default_precision: str = 'bf16',
enable_flash: bool = True,
) -> Dict[str, Any]:
"""Build common ``from_pretrained`` kwargs from a coderai model config.
Honours: load_in_4bit/8bit, flash_attention, offload_strategy, offload_dir,
max_gpu_percent, manual_ram_gb, no_ram, precision.
"""
import torch
c = _norm(cfg)
kwargs: Dict[str, Any] = {
'trust_remote_code': True,
'low_cpu_mem_usage': True,
'torch_dtype': resolve_dtype(cfg, default_precision),
}
# Quantization (transformers BitsAndBytesConfig)
quant = build_quantization_config(cfg)
if quant is not None:
kwargs['quantization_config'] = quant
bits = 4 if c.get('load_in_4bit') else 8
print(f" HF quantization: {bits}-bit (bitsandbytes)")
# Flash attention — honour any of the three flash flags (the sdcpp ones are
# no-ops for transformers models, so an enabled sdcpp flag still signals the
# user's intent to use Flash-Attention-2 here).
_flash = (c.get('flash_attention', c.get('flash_attn', False))
or c.get('sdcpp_flash_attn', False)
or c.get('sdcpp_diffusion_flash_attn', False))
if enable_flash and _flash:
try:
import flash_attn # noqa: F401
kwargs['attn_implementation'] = 'flash_attention_2'
print(" Flash Attention 2 enabled")
except Exception:
print(" Flash Attention 2 requested but not installed — ignoring")
# Offload / device placement
no_ram = bool(c.get('no_ram', False))
offload_strategy = (c.get('offload_strategy') or 'auto')
max_gpu_percent = c.get('max_gpu_percent')
if not torch.cuda.is_available():
return kwargs # CPU-only host; let transformers place on CPU
if no_ram or offload_strategy == 'none':
# Everything on GPU, no CPU spill.
kwargs['device_map'] = {'': 0}
return kwargs
# Build a max_memory map so large models split GPU → CPU RAM → disk.
try:
import psutil
free_vram, total_vram = torch.cuda.mem_get_info(0)
headroom = 512 * 1024 * 1024
if max_gpu_percent is not None:
gpu_budget = int(total_vram * max(0.0, min(1.0, float(max_gpu_percent) / 100.0)))
gpu_budget = min(gpu_budget, max(0, free_vram - headroom))
else:
gpu_budget = max(0, free_vram - headroom)
manual_ram_gb = c.get('manual_ram_gb')
if manual_ram_gb:
cpu_budget = int(float(manual_ram_gb) * 1e9)
else:
cpu_budget = max(0, psutil.virtual_memory().available - int(4e9))
kwargs['device_map'] = 'auto'
kwargs['max_memory'] = {0: gpu_budget, 'cpu': cpu_budget}
# Disk overflow when offloading is allowed.
offload_dir = c.get('offload_dir') or os.path.join(
os.path.expanduser('~'), '.cache', 'coderai', 'offload')
offload_dir = os.path.expanduser(offload_dir)
os.makedirs(offload_dir, exist_ok=True)
kwargs['offload_folder'] = offload_dir
kwargs['offload_buffers'] = True
except Exception as e:
print(f" Could not build offload map ({e}); loading with device_map=auto")
kwargs['device_map'] = 'auto'
return kwargs
def pipeline_device_kwargs(cfg: Optional[Dict[str, Any]]) -> Dict[str, Any]:
"""Return kwargs for HF ``pipeline(...)`` honouring quantization/offload.
HF pipelines accept ``model_kwargs`` (passed to from_pretrained) and a
``device_map``/``torch_dtype``. We funnel the same config through.
"""
base = build_from_pretrained_kwargs(cfg)
pk: Dict[str, Any] = {}
model_kwargs: Dict[str, Any] = {}
for k in ('quantization_config', 'attn_implementation', 'max_memory',
'offload_folder', 'offload_buffers', 'low_cpu_mem_usage',
'trust_remote_code'):
if k in base:
model_kwargs[k] = base[k]
if 'torch_dtype' in base:
pk['torch_dtype'] = base['torch_dtype']
if 'device_map' in base:
pk['device_map'] = base['device_map']
if model_kwargs:
pk['model_kwargs'] = model_kwargs
return pk
......@@ -112,6 +112,10 @@ def default_environments_dir() -> Path:
return ensure_dir(legacy_style_config_dir() / "environments")
def default_loras_dir() -> Path:
return ensure_dir(legacy_style_config_dir() / "loras")
def default_whisper_server_path() -> str:
if os.name == "nt":
local = _windows_dir("LOCALAPPDATA", _home_dir() / "AppData" / "Local")
......
......@@ -20,6 +20,14 @@ from typing import Dict, List, Optional
from pydantic import BaseModel, ConfigDict
class VideoLoraConfig(BaseModel):
"""A LoRA adapter to apply to the video diffusion pipeline for one request."""
model: str # path or HF id of the LoRA weights
weight: float = 1.0
name: Optional[str] = None
model_config = ConfigDict(extra="allow")
class CharacterDialogLine(BaseModel):
"""One spoken line in a multi-character dialog sequence."""
character: Optional[str] = None # character profile name (used for lip-sync face)
......@@ -78,6 +86,10 @@ class VideoGenerationRequest(BaseModel):
# Named saved profiles to load (resolved server-side)
character_profiles: Optional[List[str]] = None
# Per-request LoRA adapters (e.g. trained per-character identity LoRAs).
# Applied to diffusers video pipelines that support load_lora_weights.
loras: Optional[List[VideoLoraConfig]] = None
# ── Audio generation / manipulation ──────────────────────────────────
add_audio: Optional[bool] = False
audio_type: Optional[str] = None # music | speech | sfx | ambient
......
#!/usr/bin/env python3
"""
Township Fighters — Output Review & LoRA Training UI
A lightweight web UI for reviewing generated characters, environments, and
videos, collecting good/bad ratings, and exporting approved images as a LoRA
training dataset.
Usage:
python tools/review_outputs.py [--out-dir ./township_output] [--port 7860]
Then open http://localhost:7860 in your browser.
LoRA export creates:
<out-dir>/lora_dataset/
images/ ← approved images
metadata.jsonl ← caption per image (for dreambooth-style training)
train_lora.sh ← ready-to-run training command
Requirements:
pip install diffusers accelerate peft (for training)
pip install Pillow (for thumbnail generation, usually present)
"""
import argparse
import base64
import http.server
import io
import json
import mimetypes
import os
import shutil
import subprocess
import sys
import threading
import time
import urllib.parse
from pathlib import Path
from typing import Optional
# ─────────────────────────────────────────────────────────────────────────────
# Feedback persistence
# ─────────────────────────────────────────────────────────────────────────────
FEEDBACK_FILE = "feedback.json"
def _feedback_path(out_dir: Path) -> Path:
return out_dir / FEEDBACK_FILE
def load_feedback(out_dir: Path) -> dict:
p = _feedback_path(out_dir)
if p.exists():
try:
return json.loads(p.read_text())
except Exception:
pass
return {"version": 1, "items": {}}
def save_feedback(out_dir: Path, data: dict):
_feedback_path(out_dir).write_text(json.dumps(data, indent=2))
def set_rating(out_dir: Path, rel_path: str, rating: str, note: str = ""):
data = load_feedback(out_dir)
data["items"][rel_path] = {
"rating": rating, # "good" | "bad" | "skip"
"note": note,
"timestamp": int(time.time()),
}
save_feedback(out_dir, data)
# ─────────────────────────────────────────────────────────────────────────────
# Output discovery
# ─────────────────────────────────────────────────────────────────────────────
def discover_outputs(out_dir: Path) -> dict:
"""Return structured inventory of everything in the output directory."""
inv = {"characters": {}, "environments": {}, "videos": []}
chars_dir = out_dir / "characters"
if chars_dir.exists():
for char in sorted(chars_dir.iterdir()):
if not char.is_dir():
continue
meta_file = char / "meta.json"
meta = {}
if meta_file.exists():
try:
meta = json.loads(meta_file.read_text())
except Exception:
pass
images = sorted(char.glob("ref_*.png")) + sorted(char.glob("ref_*.jpg"))
inv["characters"][char.name] = {
"meta": meta,
"images": [str(p.relative_to(out_dir)) for p in images],
}
envs_dir = out_dir / "environments"
if envs_dir.exists():
for env in sorted(envs_dir.iterdir()):
if not env.is_dir():
continue
meta_file = env / "meta.json"
meta = {}
if meta_file.exists():
try:
meta = json.loads(meta_file.read_text())
except Exception:
pass
images = sorted(env.glob("ref_*.png")) + sorted(env.glob("ref_*.jpg"))
inv["environments"][env.name] = {
"meta": meta,
"images": [str(p.relative_to(out_dir)) for p in images],
}
videos_dir = out_dir / "videos"
if videos_dir.exists():
clips = sorted(videos_dir.glob("*_clip*.mp4"))
finals = [p for p in sorted(videos_dir.glob("*.mp4")) if p not in clips]
inv["videos"] = [str(p.relative_to(out_dir)) for p in finals + clips]
return inv
# ─────────────────────────────────────────────────────────────────────────────
# LoRA training export
# ─────────────────────────────────────────────────────────────────────────────
def export_lora_dataset(out_dir: Path, base_model: Optional[str] = None,
steps: int = 500, lr: str = "1e-4") -> dict:
"""
Collect all "good"-rated images + their prompts and write a
dreambooth-compatible dataset under <out-dir>/lora_dataset/.
Returns a summary dict.
"""
feedback = load_feedback(out_dir)
good = {k: v for k, v in feedback["items"].items()
if v.get("rating") == "good"}
lora_dir = out_dir / "lora_dataset"
imgs_dir = lora_dir / "images"
imgs_dir.mkdir(parents=True, exist_ok=True)
meta_lines = []
copied = 0
skipped = 0
for rel_path, fb in sorted(good.items()):
src = out_dir / rel_path
if not src.exists() or not rel_path.lower().endswith((".png", ".jpg", ".jpeg")):
skipped += 1
continue
# Build a caption from the meta.json stored alongside the image
parts = Path(rel_path).parts # e.g. ("characters", "khumalo", "ref_00.png")
caption = fb.get("note", "").strip()
if not caption and len(parts) >= 2:
category = parts[0] # "characters" or "environments"
name = parts[1]
# Look for meta.json
meta_path = out_dir / category / name / "meta.json"
if meta_path.exists():
try:
meta = json.loads(meta_path.read_text())
caption = meta.get("prompt", "") or meta.get("description", "")
except Exception:
pass
if not caption:
caption = f"{name.replace('_', ' ')}, African township fighter, cinematic"
dest_name = rel_path.replace("/", "_").replace("\\", "_")
dest = imgs_dir / dest_name
shutil.copy2(src, dest)
meta_lines.append(json.dumps({
"file_name": f"images/{dest_name}",
"text": caption,
}))
copied += 1
if not meta_lines:
return {"ok": False, "error": "No good-rated images found. Rate some images first."}
(lora_dir / "metadata.jsonl").write_text("\n".join(meta_lines) + "\n")
# Detect any existing LoRA to extend
existing_lora = _find_existing_lora(lora_dir)
# Write a ready-to-run training script
model = base_model or "stabilityai/stable-diffusion-xl-base-1.0"
_write_train_script(lora_dir, model, existing_lora, steps=steps, lr=lr)
result = {
"ok": True,
"dataset_dir": str(lora_dir),
"images": copied,
"skipped": skipped,
"train_script": str(lora_dir / "train_lora.sh"),
"metadata": str(lora_dir / "metadata.jsonl"),
}
if existing_lora:
result["extending"] = str(existing_lora)
return result
def _find_existing_lora(lora_dir: Path) -> Optional[Path]:
"""
Return the most recent LoRA weights to extend from, checking in order:
1. Latest checkpoint-NNNN/ subdirectory inside lora_weights/
2. Most recently modified .safetensors file inside lora_weights/
"""
weights_dir = lora_dir / "lora_weights"
if not weights_dir.exists():
return None
# Prefer the highest-numbered checkpoint directory (resumable mid-training)
checkpoints = sorted(
[d for d in weights_dir.iterdir()
if d.is_dir() and d.name.startswith("checkpoint-")],
key=lambda d: int(d.name.split("-")[-1])
)
if checkpoints:
return checkpoints[-1]
# Fall back to the most recently modified .safetensors file
safetensors = sorted(
weights_dir.glob("*.safetensors"),
key=lambda p: p.stat().st_mtime,
reverse=True,
)
return safetensors[0] if safetensors else None
def _write_train_script(lora_dir: Path, base_model: str,
existing_lora: Optional[Path] = None,
steps: int = 500, lr: str = "1e-4"):
"""
Write train_lora.sh.
- If existing_lora is a checkpoint-NNNN/ dir → resume via --resume_from_checkpoint
- If existing_lora is a .safetensors file → initialize LoRA from it, continue training
- If None → fresh LoRA from scratch
"""
weights_dir = lora_dir / "lora_weights"
if existing_lora and existing_lora.is_dir():
# Mid-training checkpoint: trainer can resume exactly
resume_flag = f'--resume_from_checkpoint="{existing_lora}"'
extend_note = f"# Resuming from checkpoint: {existing_lora}"
init_flag = ""
elif existing_lora and existing_lora.suffix == ".safetensors":
# Completed LoRA: load its adapter weights as starting point.
# The diffusers trainer supports --lora_model_name_or_path for this.
resume_flag = ""
init_flag = f'--lora_model_name_or_path="{existing_lora}" \\'
extend_note = f"# Extending existing LoRA: {existing_lora}"
else:
resume_flag = ""
init_flag = ""
extend_note = "# Fresh LoRA training from scratch"
resume_line = f" {resume_flag} \\\n" if resume_flag else ""
init_line = f" {init_flag}\n" if init_flag else ""
train_sh = f"""#!/bin/bash
# LoRA training script — generated by review_outputs.py
# Requires: pip install diffusers accelerate peft transformers
{extend_note}
DATASET_DIR="{lora_dir}"
OUTPUT_DIR="{weights_dir}"
BASE_MODEL="{base_model}"
mkdir -p "$OUTPUT_DIR"
accelerate launch --mixed_precision="fp16" \\
-m diffusers.scripts.train_dreambooth_lora_sdxl \\
--pretrained_model_name_or_path="$BASE_MODEL" \\
--dataset_name="$DATASET_DIR" \\
--output_dir="$OUTPUT_DIR" \\
--mixed_precision="fp16" \\
--resolution=1024 \\
--train_batch_size=1 \\
--gradient_accumulation_steps=4 \\
--learning_rate={lr} \\
--lr_scheduler="constant" \\
--lr_warmup_steps=0 \\
--max_train_steps={steps} \\
--checkpointing_steps=100 \\
--seed=42 \\
--report_to="none" \\
{resume_line}{init_line}
echo ""
echo "LoRA weights saved to: $OUTPUT_DIR"
echo "To use: add the .safetensors file to your CoderAI model config as a LoRA."
"""
script = lora_dir / "train_lora.sh"
script.write_text(train_sh)
script.chmod(0o755)
def run_lora_training(out_dir: Path, steps: int = 500, lr: str = "1e-4") -> dict:
"""Launch the generated train_lora.sh in the background."""
lora_dir = out_dir / "lora_dataset"
script = lora_dir / "train_lora.sh"
if not script.exists():
return {"ok": False, "error": "No training script found. Export dataset first."}
# Re-generate script with latest steps/lr in case user changed them
existing = _find_existing_lora(lora_dir)
meta = lora_dir / "metadata.jsonl"
if meta.exists():
# Read base model from the existing script if we have one
base_model = "stabilityai/stable-diffusion-xl-base-1.0"
try:
for line in script.read_text().splitlines():
if line.strip().startswith("BASE_MODEL="):
base_model = line.split("=", 1)[1].strip().strip('"')
break
except Exception:
pass
_write_train_script(lora_dir, base_model, existing, steps=steps, lr=lr)
log_path = out_dir / "lora_dataset" / "training.log"
proc = subprocess.Popen(
["bash", str(script)],
stdout=open(log_path, "w"),
stderr=subprocess.STDOUT,
cwd=str(out_dir),
)
return {
"ok": True,
"pid": proc.pid,
"log": str(log_path),
"message": f"Training started (PID {proc.pid}). Watch {log_path}",
}
# ─────────────────────────────────────────────────────────────────────────────
# Embedded HTML/JS UI
# ─────────────────────────────────────────────────────────────────────────────
_HTML = r"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Township Fighters — Review</title>
<style>
*{box-sizing:border-box;margin:0;padding:0}
body{font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',sans-serif;background:#111;color:#e0e0e0;min-height:100vh}
header{background:#1a1a1a;border-bottom:1px solid #333;padding:.75rem 1.25rem;display:flex;align-items:center;gap:1rem;position:sticky;top:0;z-index:100}
header h1{font-size:15px;font-weight:700;color:#fff}
.tabs{display:flex;gap:.25rem}
.tab{padding:.35rem .85rem;border-radius:5px;cursor:pointer;font-size:12px;font-weight:600;color:#999;background:transparent;border:1px solid transparent;transition:all .15s}
.tab.active,.tab:hover{background:#2a2a2a;color:#fff;border-color:#444}
.tab.active{background:#6366f1;border-color:#6366f1;color:#fff}
.stats{margin-left:auto;font-size:11px;color:#666}
.stats span{color:#888}
main{padding:1.25rem;max-width:1400px;margin:0 auto}
.section{display:none}.section.active{display:block}
.grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(180px,1fr));gap:1rem}
.card{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;overflow:hidden;transition:border-color .15s}
.card:hover{border-color:#444}
.card.good{border-color:#22c55e}
.card.bad{border-color:#ef4444}
.card.skip{border-color:#f59e0b}
.thumb{width:100%;aspect-ratio:1;object-fit:cover;display:block;cursor:pointer;background:#0a0a0a}
.card-body{padding:.5rem}
.card-name{font-size:11px;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;margin-bottom:.4rem}
.card-actions{display:flex;gap:.3rem}
.btn{flex:1;padding:.3rem;border-radius:4px;border:none;cursor:pointer;font-size:11px;font-weight:600;transition:opacity .1s}
.btn:hover{opacity:.8}
.btn-good{background:#16a34a;color:#fff}
.btn-bad{background:#dc2626;color:#fff}
.btn-skip{background:#d97706;color:#fff}
.btn-clear{background:#2a2a2a;color:#aaa;font-size:10px}
.note-input{width:100%;margin-top:.35rem;padding:.25rem .4rem;background:#111;border:1px solid #333;border-radius:3px;color:#ccc;font-size:10px;resize:none}
.group-header{font-size:13px;font-weight:700;color:#ddd;margin:1.25rem 0 .6rem;padding-bottom:.3rem;border-bottom:1px solid #2a2a2a}
.video-card{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;overflow:hidden;transition:border-color .15s}
.video-card:hover{border-color:#444}
.video-card.good{border-color:#22c55e}.video-card.bad{border-color:#ef4444}.video-card.skip{border-color:#f59e0b}
.video-card video{width:100%;display:block;max-height:200px;background:#000}
.video-body{padding:.5rem}
.video-name{font-size:11px;color:#aaa;white-space:nowrap;overflow:hidden;text-overflow:ellipsis;margin-bottom:.4rem}
.video-grid{display:grid;grid-template-columns:repeat(auto-fill,minmax(300px,1fr));gap:1rem}
.train-panel{background:#1c1c1c;border:1px solid #2a2a2a;border-radius:8px;padding:1.25rem;max-width:680px}
.train-panel h2{font-size:14px;font-weight:700;margin-bottom:.75rem}
.train-row{display:flex;gap:.75rem;align-items:center;margin-bottom:.75rem}
.train-label{font-size:12px;color:#aaa;width:130px;flex-shrink:0}
.train-input{flex:1;padding:.4rem .6rem;background:#111;border:1px solid #333;border-radius:4px;color:#ddd;font-size:12px}
.action-btn{padding:.55rem 1.2rem;background:#6366f1;color:#fff;border:none;border-radius:5px;cursor:pointer;font-size:13px;font-weight:600}
.action-btn:hover{background:#4f46e5}
.action-btn.danger{background:#dc2626}
.result-box{background:#111;border:1px solid #333;border-radius:5px;padding:.75rem;font-size:12px;color:#aaa;margin-top:.75rem;white-space:pre-wrap;display:none}
.result-box.visible{display:block}
.counter-badge{display:inline-block;padding:.1rem .4rem;border-radius:3px;font-size:10px;font-weight:700;margin-left:.3rem}
.good-badge{background:#16a34a22;color:#4ade80}
.bad-badge{background:#dc262622;color:#f87171}
.skip-badge{background:#d9770622;color:#fbbf24}
.lightbox{display:none;position:fixed;inset:0;background:rgba(0,0,0,.9);z-index:9999;align-items:center;justify-content:center;cursor:zoom-out}
.lightbox.visible{display:flex}
.lightbox img{max-width:90vw;max-height:90vh;object-fit:contain;border-radius:6px}
</style>
</head>
<body>
<div id="lightbox" class="lightbox" onclick="closeLightbox()"><img id="lb-img" src=""></div>
<header>
<h1>Township Fighters — Review</h1>
<div class="tabs">
<div class="tab active" onclick="switchTab('characters',this)">Characters <span id="t-chars" class="counter-badge good-badge"></span></div>
<div class="tab" onclick="switchTab('environments',this)">Environments <span id="t-envs" class="counter-badge good-badge"></span></div>
<div class="tab" onclick="switchTab('videos',this)">Videos <span id="t-vids" class="counter-badge good-badge"></span></div>
<div class="tab" onclick="switchTab('training',this)">Training</div>
</div>
<div class="stats"><span id="stat-good">0</span> good · <span id="stat-bad">0</span> bad · <span id="stat-skip">0</span> maybe</div>
</header>
<main>
<div id="sec-characters" class="section active"></div>
<div id="sec-environments" class="section"></div>
<div id="sec-videos" class="section"></div>
<div id="sec-training" class="section">
<div class="train-panel">
<h2>LoRA Training from Approved Images</h2>
<div id="extend-notice" style="display:none;background:#1a2a1a;border:1px solid #2d4a2d;border-radius:5px;padding:.6rem .8rem;margin-bottom:.8rem;font-size:12px;color:#86efac;line-height:1.5"></div>
<div class="train-row">
<span class="train-label">Base image model</span>
<input id="base-model" class="train-input" value="" placeholder="e.g. John6666/pornmaster-pro-pony-asianponyv3vae-sdxl">
</div>
<div class="train-row">
<span class="train-label">Extra steps</span>
<input id="extra-steps" class="train-input" type="number" value="500" min="50" max="5000"
title="Total new steps to run (added on top of any existing training)">
</div>
<div class="train-row">
<span class="train-label">Learning rate</span>
<input id="lr" class="train-input" value="1e-4" placeholder="1e-4">
</div>
<div style="display:flex;gap:.75rem;flex-wrap:wrap;margin-top:.25rem">
<button class="action-btn" onclick="exportDataset()">1. Export / refresh dataset</button>
<button class="action-btn" onclick="trainLora()">2. Train / extend LoRA</button>
</div>
<div id="train-result" class="result-box"></div>
<div style="margin-top:1rem;font-size:11px;color:#555;line-height:1.7">
<b style="color:#888">First run (fresh LoRA):</b><br>
&nbsp;1. Rate images 👍 Good in Characters / Environments tabs.<br>
&nbsp;2. Click <b>Export dataset</b> → copies approved images + prompts to <code>lora_dataset/</code>.<br>
&nbsp;3. Click <b>Train LoRA</b> → runs training in background, saves <code>lora_weights/*.safetensors</code>.<br>
&nbsp;4. Add the <code>.safetensors</code> to CoderAI's image model LoRA settings.<br>
<br>
<b style="color:#888">Subsequent runs (extending an existing LoRA):</b><br>
&nbsp;1. Generate more content, rate new images as 👍 Good.<br>
&nbsp;2. Click <b>Export dataset</b> → adds new images alongside old ones.<br>
&nbsp;3. Click <b>Train LoRA</b> → auto-detects existing weights and continues from them.<br>
&nbsp;&nbsp;&nbsp;&nbsp;If a <code>checkpoint-NNNN/</code> dir exists → resumes exactly from that point.<br>
&nbsp;&nbsp;&nbsp;&nbsp;If only a <code>.safetensors</code> exists → initializes adapter from it, then trains.<br>
<br>
<b style="color:#888">Requirements:</b> <code>pip install peft accelerate</code>
</div>
</div>
</div>
</main>
<script>
let _inv = {}, _fb = {};
async function api(path, body) {
const opts = body
? {method:'POST', headers:{'Content-Type':'application/json'}, body:JSON.stringify(body)}
: {method:'GET'};
const r = await fetch(path, opts);
return r.json();
}
async function init() {
const data = await api('/api/data');
_inv = data.inventory;
_fb = data.feedback;
renderCharacters();
renderEnvironments();
renderVideos();
updateStats();
}
function switchTab(name, el) {
document.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
document.querySelectorAll('.section').forEach(s => s.classList.remove('active'));
el.classList.add('active');
document.getElementById('sec-' + name).classList.add('active');
if (name === 'training') checkExistingLora();
}
function rating(relPath) { return (_fb[relPath] || {}).rating || ''; }
function note(relPath) { return (_fb[relPath] || {}).note || ''; }
function updateStats() {
const vals = Object.values(_fb);
document.getElementById('stat-good').textContent = vals.filter(v=>v.rating==='good').length;
document.getElementById('stat-bad').textContent = vals.filter(v=>v.rating==='bad').length;
document.getElementById('stat-skip').textContent = vals.filter(v=>v.rating==='skip').length;
// Update tab badges
const charGood = Object.keys(_fb).filter(k=>k.startsWith('characters/')&&_fb[k].rating==='good').length;
const envGood = Object.keys(_fb).filter(k=>k.startsWith('environments/')&&_fb[k].rating==='good').length;
const vidGood = Object.keys(_fb).filter(k=>k.startsWith('videos/')&&_fb[k].rating==='good').length;
document.getElementById('t-chars').textContent = charGood || '';
document.getElementById('t-envs').textContent = envGood || '';
document.getElementById('t-vids').textContent = vidGood || '';
}
async function rate(relPath, rating, noteEl) {
const n = noteEl ? noteEl.value : (note(relPath) || '');
_fb[relPath] = {rating, note: n, timestamp: Date.now()/1000|0};
await api('/api/rate', {path: relPath, rating, note: n});
// Update card border
document.querySelectorAll(`[data-path="${CSS.escape(relPath)}"]`).forEach(el => {
el.classList.remove('good','bad','skip');
if (rating) el.classList.add(rating);
});
updateStats();
}
async function clearRate(relPath) {
delete _fb[relPath];
await api('/api/rate', {path: relPath, rating: '', note: ''});
document.querySelectorAll(`[data-path="${CSS.escape(relPath)}"]`).forEach(el => {
el.classList.remove('good','bad','skip');
});
updateStats();
}
function imageCard(relPath) {
const r = rating(relPath);
const n = note(relPath);
const fname = relPath.split('/').pop();
const id = relPath.replace(/[^a-z0-9]/gi,'_');
return `
<div class="card ${r}" data-path="${relPath}">
<img class="thumb" src="/file/${relPath}" alt="${fname}"
title="${relPath}" onclick="openLightbox('/file/${relPath}')">
<div class="card-body">
<div class="card-name" title="${relPath}">${fname}</div>
<div class="card-actions">
<button class="btn btn-good" onclick="rate('${relPath}','good', document.getElementById('n_${id}'))">👍</button>
<button class="btn btn-bad" onclick="rate('${relPath}','bad', document.getElementById('n_${id}'))">👎</button>
<button class="btn btn-skip" onclick="rate('${relPath}','skip', document.getElementById('n_${id}'))">🤔</button>
<button class="btn btn-clear" onclick="clearRate('${relPath}')">✕</button>
</div>
<textarea id="n_${id}" class="note-input" rows="2"
placeholder="Optional note…"
onblur="if(_fb['${relPath}'])rate('${relPath}',_fb['${relPath}'].rating,this)"
>${n}</textarea>
</div>
</div>`;
}
function videoCard(relPath) {
const r = rating(relPath);
const n = note(relPath);
const fname = relPath.split('/').pop();
const id = relPath.replace(/[^a-z0-9]/gi,'_');
return `
<div class="video-card ${r}" data-path="${relPath}">
<video controls preload="metadata" src="/file/${relPath}"></video>
<div class="video-body">
<div class="video-name" title="${relPath}">${fname}</div>
<div class="card-actions">
<button class="btn btn-good" onclick="rate('${relPath}','good', document.getElementById('n_${id}'))">👍 Good</button>
<button class="btn btn-bad" onclick="rate('${relPath}','bad', document.getElementById('n_${id}'))">👎 Bad</button>
<button class="btn btn-skip" onclick="rate('${relPath}','skip', document.getElementById('n_${id}'))">🤔 Maybe</button>
<button class="btn btn-clear" onclick="clearRate('${relPath}')">✕</button>
</div>
<textarea id="n_${id}" class="note-input" rows="2"
placeholder="Optional note…"
onblur="if(_fb['${relPath}'])rate('${relPath}',_fb['${relPath}'].rating,this)"
>${n}</textarea>
</div>
</div>`;
}
function renderCharacters() {
let html = '';
for (const [name, data] of Object.entries(_inv.characters || {})) {
const meta = data.meta || {};
const desc = meta.description || '';
html += `<div class="group-header">${name}<span style="font-weight:400;color:#666;font-size:11px;margin-left:.5rem">${desc}</span></div>`;
html += '<div class="grid">';
for (const img of data.images) html += imageCard(img);
html += '</div>';
}
document.getElementById('sec-characters').innerHTML = html || '<div style="color:#555;padding:2rem">No characters found in output directory.</div>';
}
function renderEnvironments() {
let html = '';
for (const [name, data] of Object.entries(_inv.environments || {})) {
const meta = data.meta || {};
const desc = meta.description || '';
html += `<div class="group-header">${name}<span style="font-weight:400;color:#666;font-size:11px;margin-left:.5rem">${desc}</span></div>`;
html += '<div class="grid">';
for (const img of data.images) html += imageCard(img);
html += '</div>';
}
document.getElementById('sec-environments').innerHTML = html || '<div style="color:#555;padding:2rem">No environments found in output directory.</div>';
}
function renderVideos() {
let html = '<div class="video-grid">';
for (const v of (_inv.videos || [])) html += videoCard(v);
html += '</div>';
document.getElementById('sec-videos').innerHTML =
(_inv.videos || []).length ? html : '<div style="color:#555;padding:2rem">No videos found in output directory.</div>';
}
async function exportDataset() {
const baseModel = document.getElementById('base-model').value.trim();
const steps = parseInt(document.getElementById('extra-steps').value) || 500;
const lr = document.getElementById('lr').value.trim() || '1e-4';
const box = document.getElementById('train-result');
box.textContent = 'Exporting dataset…';
box.classList.add('visible');
const r = await api('/api/export', {base_model: baseModel, steps, lr});
if (r.ok) {
let msg = `✓ Dataset ready:\n ${r.images} image(s) → ${r.dataset_dir}\n Script: ${r.train_script}`;
if (r.extending) msg += `\n\n ↪ Will extend existing LoRA:\n ${r.extending}`;
box.textContent = msg;
// Update the notice banner
const notice = document.getElementById('extend-notice');
if (r.extending) {
notice.style.display = '';
notice.innerHTML = `↪ Extending existing LoRA: <code>${r.extending}</code><br>
New approved images will be added on top — previous learning is preserved.`;
} else {
notice.style.display = 'none';
}
} else {
box.textContent = '✗ ' + (r.error || 'Export failed');
}
}
async function trainLora() {
const steps = parseInt(document.getElementById('extra-steps').value) || 500;
const lr = document.getElementById('lr').value.trim() || '1e-4';
const box = document.getElementById('train-result');
box.textContent = 'Starting training…';
box.classList.add('visible');
const r = await api('/api/train', {steps, lr});
box.textContent = r.ok ? `✓ ${r.message}\n\nLog file: ${r.log}` : '✗ ' + (r.error || 'Training failed');
}
// Check for existing LoRA on tab switch to training
async function checkExistingLora() {
const r = await api('/api/lora-status');
const notice = document.getElementById('extend-notice');
if (r.existing_lora) {
notice.style.display = '';
notice.innerHTML = `↪ Existing LoRA detected: <code>${r.existing_lora}</code><br>
Exporting will continue from this checkpoint — previous learning is preserved.`;
}
}
function openLightbox(src) {
document.getElementById('lb-img').src = src;
document.getElementById('lightbox').classList.add('visible');
}
function closeLightbox() {
document.getElementById('lightbox').classList.remove('visible');
}
document.addEventListener('keydown', e => { if (e.key==='Escape') closeLightbox(); });
init();
</script>
</body>
</html>
"""
# ─────────────────────────────────────────────────────────────────────────────
# HTTP server
# ─────────────────────────────────────────────────────────────────────────────
class ReviewHandler(http.server.BaseHTTPRequestHandler):
out_dir: Path = None
def log_message(self, fmt, *args):
pass # suppress per-request access log
def _send(self, code: int, content_type: str, body: bytes):
self.send_response(code)
self.send_header("Content-Type", content_type)
self.send_header("Content-Length", str(len(body)))
self.end_headers()
self.wfile.write(body)
def _json(self, data: dict, code: int = 200):
body = json.dumps(data).encode()
self._send(code, "application/json", body)
def do_GET(self):
parsed = urllib.parse.urlparse(self.path)
path = parsed.path
if path == "/" or path == "":
self._send(200, "text/html; charset=utf-8", _HTML.encode())
return
if path == "/api/data":
inv = discover_outputs(self.out_dir)
fb = load_feedback(self.out_dir)
self._json({"inventory": inv, "feedback": fb["items"]})
return
if path == "/api/lora-status":
existing = _find_existing_lora(self.out_dir / "lora_dataset")
self._json({"existing_lora": str(existing) if existing else None})
return
if path.startswith("/file/"):
rel = urllib.parse.unquote(path[6:])
abs_path = self.out_dir / rel
# Security: resolve and ensure it's inside out_dir
try:
abs_path = abs_path.resolve()
self.out_dir.resolve()
abs_path.relative_to(self.out_dir.resolve())
except (ValueError, Exception):
self._send(403, "text/plain", b"Forbidden")
return
if not abs_path.exists():
self._send(404, "text/plain", b"Not found")
return
mime = mimetypes.guess_type(str(abs_path))[0] or "application/octet-stream"
self._send(200, mime, abs_path.read_bytes())
return
self._send(404, "text/plain", b"Not found")
def do_POST(self):
length = int(self.headers.get("Content-Length", 0))
body = json.loads(self.rfile.read(length) or b"{}")
path = urllib.parse.urlparse(self.path).path
if path == "/api/rate":
rel = body.get("path", "")
rating = body.get("rating", "")
note = body.get("note", "")
if rating:
set_rating(self.out_dir, rel, rating, note)
else:
# Clear
data = load_feedback(self.out_dir)
data["items"].pop(rel, None)
save_feedback(self.out_dir, data)
self._json({"ok": True})
return
if path == "/api/export":
result = export_lora_dataset(
self.out_dir,
base_model=body.get("base_model") or None,
steps=int(body.get("steps") or 500),
lr=body.get("lr") or "1e-4",
)
self._json(result)
return
if path == "/api/train":
result = run_lora_training(
self.out_dir,
steps=int(body.get("steps") or 500),
lr=body.get("lr") or "1e-4",
)
self._json(result)
return
self._send(404, "text/plain", b"Not found")
# ─────────────────────────────────────────────────────────────────────────────
# Main
# ─────────────────────────────────────────────────────────────────────────────
def main():
parser = argparse.ArgumentParser(
description="Township Fighters — Output Review & LoRA Training UI",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
WHAT THIS TOOL DOES
───────────────────
Serves a web UI (no external dependencies beyond Python stdlib) where you
review everything gen_township_fighters.py produced, rate each item, and
progressively improve generation quality through LoRA fine-tuning.
Characters tab — grid of all fighter reference images, one group per
fighter. Click any image to enlarge it. Rate each one:
👍 Good — include in training data
👎 Bad — exclude (wrong face, bad lighting, etc.)
🤔 Maybe — revisit later
Add an optional text note per image.
Environments tab — same for environment reference images.
Videos tab — inline video player for every generated clip (short,
long, and outcome videos). Rate to track which prompts
and settings produced the best results.
Training tab — build and launch LoRA training from approved images:
Step 1: Export dataset → copies all 👍-rated images +
their generation prompts into lora_dataset/
(dreambooth format with metadata.jsonl).
Step 2: Train LoRA → runs lora_dataset/train_lora.sh
via accelerate in a background process; saves
.safetensors weights to lora_dataset/lora_weights/.
You can also tune Steps (how long to train) and Learning
rate directly in the tab before clicking either button.
FEEDBACK STORAGE
────────────────
Ratings are saved instantly to <out-dir>/feedback.json — no submit button.
The file is plain JSON and can be committed to version control to track
which outputs were acceptable across multiple generation runs.
OUTPUT DIRECTORY STRUCTURE
──────────────────────────
<out-dir>/
characters/<name>/ref_NN.png ← fighter reference images
environments/<name>/ref_NN.png ← location reference images
videos/match_*.mp4 ← fight clips
feedback.json ← your ratings (written by this tool)
lora_dataset/
images/ ← approved images copied here
metadata.jsonl ← one caption per image (generation prompt)
train_lora.sh ← ready-to-run training command
lora_weights/
pytorch_lora_weights.safetensors ← final LoRA weights
checkpoint-100/ ← mid-training checkpoints
checkpoint-200/
LORA TRAINING — FIRST RUN (fresh LoRA from scratch)
────────────────────────────────────────────────────
Requirements: pip install peft accelerate
(diffusers and transformers are already installed by CoderAI)
1. Generate content with gen_township_fighters.py.
2. Open this UI and rate images: 👍 for ones with good likeness/style.
Aim for at least 10-20 good images per subject for decent results.
3. In the Training tab, set the base model to the image model you used
(e.g. John6666/pornmaster-pro-pony-asianponyv3vae-sdxl).
4. Click "Export dataset" — verifies the dataset and shows a summary.
5. Click "Train LoRA" — training runs in the background (~5-15 min on
an RTX 3090 for 500 steps). Watch lora_dataset/training.log.
6. When done, add the .safetensors file to CoderAI:
Admin → Models → configure your image model → LoRA path
LORA TRAINING — SUBSEQUENT RUNS (extending an existing LoRA)
─────────────────────────────────────────────────────────────
After generating more fighters or environments and rating the new images:
1. Click "Export dataset" again — new approved images are added to the
dataset alongside the old ones.
2. The tool auto-detects any existing weights:
• If lora_weights/checkpoint-NNN/ exists → uses --resume_from_checkpoint
(full optimizer state restored; smoothest convergence).
• If only a .safetensors file exists → uses --lora_model_name_or_path
(adapter weights loaded as starting point; optimizer restarts).
3. A green banner in the Training tab confirms what will be extended.
4. Click "Train LoRA" — runs the additional steps on top of what was
already learned. Use a lower learning rate (e.g. 5e-5) for refinement.
5. The updated .safetensors replaces the old one in lora_weights/.
Each round of generate → rate → train improves quality incrementally.
There is no limit to how many rounds you can do.
TUNING TIPS
───────────
Steps 500 = quick first pass; good enough to see improvement
1000 = more thorough; use for final production LoRA
Learning rate
1e-4 = default for a fresh LoRA or large dataset additions
5e-5 = gentler refinement when extending an existing LoRA
1e-5 = very fine correction; use only with a mature LoRA
If generated characters look too generic after training:
→ Add more diverse good images (different angles, lighting)
→ Lower the learning rate on the next extension run
If generated characters over-fit (all look the same):
→ Fewer steps or higher learning rate on the next run
→ Drop some near-duplicate images from the dataset
EXAMPLES
────────
# Open the UI for the default output directory:
python tools/review_outputs.py
# Custom output directory:
python tools/review_outputs.py --out-dir /data/township_project
# Different port (useful if 7860 is taken by another tool):
python tools/review_outputs.py --port 8888
# Headless server — don't auto-open the browser:
python tools/review_outputs.py --no-open --port 7860
# Full workflow in one session:
python tools/gen_township_fighters.py --out-dir ./fights # generate
python tools/review_outputs.py --out-dir ./fights # review & train
python tools/gen_township_fighters.py --out-dir ./fights \\ # generate more
--reuse-fighters --reuse-environments --skip-characters \\
--skip-environments
python tools/review_outputs.py --out-dir ./fights # extend LoRA
""",
)
parser.add_argument("--out-dir", default="./township_output", metavar="DIR",
help="Output directory from gen_township_fighters.py (default: ./township_output)")
parser.add_argument("--port", type=int, default=7860, metavar="PORT",
help="Port to serve the UI on (default: 7860)")
parser.add_argument("--no-open", action="store_true",
help="Do not automatically open the browser")
args = parser.parse_args()
out_dir = Path(args.out_dir).resolve()
if not out_dir.exists():
print(f"✗ Output directory not found: {out_dir}")
print(" Run gen_township_fighters.py first to generate content.")
sys.exit(1)
ReviewHandler.out_dir = out_dir
server = http.server.HTTPServer(("0.0.0.0", args.port), ReviewHandler)
inv = discover_outputs(out_dir)
n_chars = sum(len(v["images"]) for v in inv["characters"].values())
n_envs = sum(len(v["images"]) for v in inv["environments"].values())
n_vids = len(inv["videos"])
fb = load_feedback(out_dir)
n_rated = len(fb["items"])
print(f"""
╔══════════════════════════════════════════════════════════╗
║ Township Fighters — Output Review UI ║
╚══════════════════════════════════════════════════════════╝
Output dir : {out_dir}
Content : {n_chars} character images · {n_envs} environment images · {n_vids} videos
Feedback : {n_rated} item(s) already rated
URL : http://localhost:{args.port}
""")
if not args.no_open:
def _open():
time.sleep(0.4)
import webbrowser
webbrowser.open(f"http://localhost:{args.port}")
threading.Thread(target=_open, daemon=True).start()
try:
server.serve_forever()
except KeyboardInterrupt:
print("\n Stopped.")
if __name__ == "__main__":
main()
#!/usr/bin/env python3
"""Dub a video/audio file through CoderAI API while preserving background audio.
The script keeps orchestration, media slicing, timing, mixing, and muxing local.
All AI work is delegated to CoderAI endpoints:
- /v1/audio/transcriptions for dialogue detection/transcription
- /v1/chat/completions for speaker assignment, translation, and metric fitting
- /v1/audio/voices for voice profiles
- /v1/audio/clone for cloned speech generation
- /v1/audio/convert for singing/performance voice conversion when requested
- /v1/audio/stems for optional dialogue/background separation
External tools required locally: ffmpeg and ffprobe.
Python dependency required: requests.
"""
from __future__ import annotations
import argparse
import base64
import dataclasses
import json
import math
import os
import re
import shutil
import subprocess
import sys
import tempfile
import textwrap
import time
import uuid
from pathlib import Path
from typing import Any, Iterable
try:
import requests
except ImportError as exc: # pragma: no cover - user environment check
raise SystemExit("This script requires requests: pip install requests") from exc
DEFAULT_BASE_URL = os.environ.get("CODERAI_BASE_URL", "http://127.0.0.1:8000")
DEFAULT_API_KEY = os.environ.get("CODERAI_API_KEY")
SERVICE_ENV_PREFIXES = {
"transcribe": "CODERAI_TRANSCRIBE",
"text": "CODERAI_TEXT",
"voice": "CODERAI_VOICE",
"convert": "CODERAI_CONVERT",
"stems": "CODERAI_STEMS",
}
AUDIO_EXTS = {".wav", ".mp3", ".m4a", ".aac", ".flac", ".ogg", ".opus", ".webm"}
VIDEO_EXTS = {".mp4", ".mkv", ".mov", ".avi", ".webm", ".m4v"}
SRT_TIME_RE = re.compile(
r"(?P<h>\d{2}):(?P<m>\d{2}):(?P<s>\d{2})[,.](?P<ms>\d{1,3})\s*-->\s*"
r"(?P<eh>\d{2}):(?P<em>\d{2}):(?P<es>\d{2})[,.](?P<ems>\d{1,3})"
)
@dataclasses.dataclass
class Segment:
index: int
start: float
end: float
text: str
speaker: str = "speaker_01"
translated: str = ""
is_singing: bool = False
voice_name: str = ""
ref_audio: Path | None = None
generated_audio: Path | None = None
@property
def duration(self) -> float:
return max(0.05, self.end - self.start)
class CoderAIClient:
def __init__(self, base_url: str, api_key: str | None = None, timeout: int = 7200):
self.base_url = base_url.rstrip("/")
self.timeout = timeout
self.session = requests.Session()
if api_key:
self.session.headers["Authorization"] = f"Bearer {api_key}"
def _post_json(self, path: str, body: dict[str, Any]) -> dict[str, Any]:
response = self.session.post(f"{self.base_url}{path}", json=body, timeout=self.timeout)
if not response.ok:
raise RuntimeError(f"POST {path} failed: {response.status_code} {response.text[:800]}")
return response.json()
def _post_multipart(self, path: str, data: dict[str, Any], files: dict[str, Any]) -> Any:
response = self.session.post(f"{self.base_url}{path}", data=data, files=files, timeout=self.timeout)
if not response.ok:
raise RuntimeError(f"POST {path} failed: {response.status_code} {response.text[:800]}")
return response
def list_models(self) -> list[dict[str, Any]]:
response = self.session.get(f"{self.base_url}/v1/models", timeout=60)
if not response.ok:
return []
return response.json().get("data", [])
def transcribe(self, audio_path: Path, model: str, language: str | None) -> list[Segment]:
with audio_path.open("rb") as handle:
response = self._post_multipart(
"/v1/audio/transcriptions",
{
"model": model,
"language": language or "",
"response_format": "srt",
"temperature": "0",
},
{"file": (audio_path.name, handle, "application/octet-stream")},
)
return parse_srt(response.text)
def chat_json(self, model: str, system: str, user: str, max_tokens: int = 4096) -> Any:
data = self._post_json(
"/v1/chat/completions",
{
"model": model,
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": user},
],
"temperature": 0.2,
"max_tokens": max_tokens,
},
)
content = (data.get("choices") or [{}])[0].get("message", {}).get("content", "")
return extract_json(content)
def create_voice(self, name: str, audio_path: Path, transcript: str, description: str) -> None:
with audio_path.open("rb") as handle:
response = self.session.post(
f"{self.base_url}/v1/audio/voices",
data={"name": name, "transcript": transcript, "description": description},
files={"audio": (audio_path.name, handle, "audio/wav")},
timeout=self.timeout,
)
if response.status_code == 400 and "already" in response.text.lower():
return
if not response.ok:
raise RuntimeError(f"Create voice {name} failed: {response.status_code} {response.text[:800]}")
def clone_voice(self, voice_name: str, text: str, speed: float, out_path: Path) -> None:
data = self._post_json(
"/v1/audio/clone",
{"voice_name": voice_name, "text": text, "speed": speed, "response_format": "b64_wav"},
)
item = (data.get("data") or [{}])[0]
write_api_audio_item(item, out_path, self.session)
def convert_voice(
self,
source_audio: Path,
voice_name: str | None,
out_path: Path,
target_voice: Path | None = None,
f0_condition: bool = True,
length_adjust: float = 1.0,
) -> None:
body: dict[str, Any] = {
"source_audio": file_data_uri(source_audio, "audio/wav"),
"f0_condition": f0_condition,
"length_adjust": length_adjust,
"response_format": "b64_wav",
}
if target_voice is not None:
body["target_voice"] = file_data_uri(target_voice, "audio/wav")
elif voice_name:
body["voice_name"] = voice_name
else:
raise RuntimeError("Voice conversion requires voice_name or target_voice")
data = self._post_json(
"/v1/audio/convert",
body,
)
item = (data.get("data") or [{}])[0]
write_api_audio_item(item, out_path, self.session)
def separate_stems(self, audio_path: Path, workdir: Path, fallback: bool) -> tuple[Path, Path] | None:
data = self._post_json(
"/v1/audio/stems",
{
"audio": file_data_uri(audio_path, "audio/wav"),
"stem_mode": "vocals-instrumental",
"fallback_mode": fallback,
"response_format": "b64_wav",
},
)
vocals = None
instrumental = None
for item in data.get("data", []):
target = workdir / f"stem_{item.get('name', uuid.uuid4().hex)}.wav"
write_api_audio_item(item, target, self.session)
role = (item.get("role") or item.get("name") or "").lower()
if "vocal" in role:
vocals = target
if "instrument" in role or "backing" in role:
instrumental = target
if vocals and instrumental:
return vocals, instrumental
return None
@dataclasses.dataclass(frozen=True)
class CoderAIClients:
default: CoderAIClient
transcribe: CoderAIClient
text: CoderAIClient
voice: CoderAIClient
convert: CoderAIClient
stems: CoderAIClient
def run(cmd: list[str], *, timeout: int | None = None) -> None:
proc = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, timeout=timeout)
if proc.returncode != 0:
rendered = " ".join(cmd)
detail = proc.stderr.strip() or proc.stdout.strip() or "command failed"
raise RuntimeError(f"{rendered}\n{detail[:2000]}")
def require_binary(name: str) -> str:
path = shutil.which(name)
if not path:
raise SystemExit(f"Required binary not found: {name}")
return path
def media_duration(path: Path) -> float:
proc = subprocess.run(
["ffprobe", "-v", "error", "-show_entries", "format=duration", "-of", "json", str(path)],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
)
if proc.returncode != 0:
raise RuntimeError(proc.stderr.strip())
return float(json.loads(proc.stdout)["format"]["duration"])
def is_video(path: Path) -> bool:
if path.suffix.lower() in VIDEO_EXTS:
return True
if path.suffix.lower() in AUDIO_EXTS:
return False
proc = subprocess.run(
["ffprobe", "-v", "error", "-select_streams", "v:0", "-show_entries", "stream=codec_type", "-of", "csv=p=0", str(path)],
stdout=subprocess.PIPE,
stderr=subprocess.DEVNULL,
text=True,
)
return "video" in proc.stdout
def extract_audio(input_path: Path, output_path: Path) -> None:
run(["ffmpeg", "-y", "-i", str(input_path), "-vn", "-ac", "2", "-ar", "44100", "-c:a", "pcm_s16le", str(output_path)])
def slice_audio(input_path: Path, start: float, end: float, output_path: Path) -> None:
run([
"ffmpeg",
"-y",
"-ss",
f"{start:.3f}",
"-to",
f"{end:.3f}",
"-i",
str(input_path),
"-ac",
"1",
"-ar",
"22050",
"-c:a",
"pcm_s16le",
str(output_path),
])
def adjust_audio_timing(input_path: Path, target_duration: float, output_path: Path, max_stretch: float) -> None:
source_duration = media_duration(input_path)
if source_duration <= 0:
raise RuntimeError(f"Invalid generated audio duration for {input_path}")
ratio = source_duration / target_duration
ratio = min(max(ratio, 1.0 / max_stretch), max_stretch)
filters = []
if abs(ratio - 1.0) > 0.03:
filters.append(atempo_chain(ratio))
if source_duration / ratio < target_duration:
filters.append(f"apad=whole_dur={target_duration:.3f}")
filter_arg = ",".join(filters) if filters else "anull"
run([
"ffmpeg",
"-y",
"-i",
str(input_path),
"-af",
filter_arg,
"-t",
f"{target_duration:.3f}",
"-ac",
"2",
"-ar",
"44100",
"-c:a",
"pcm_s16le",
str(output_path),
])
def atempo_chain(ratio: float) -> str:
parts: list[str] = []
remaining = ratio
while remaining > 2.0:
parts.append("atempo=2.0")
remaining /= 2.0
while remaining < 0.5:
parts.append("atempo=0.5")
remaining /= 0.5
parts.append(f"atempo={remaining:.6f}")
return ",".join(parts)
def build_dub_track(segments: list[Segment], duration: float, out_path: Path, workdir: Path) -> None:
silence = workdir / "silence.wav"
run([
"ffmpeg",
"-y",
"-f",
"lavfi",
"-i",
"anullsrc=channel_layout=stereo:sample_rate=44100",
"-t",
f"{duration:.3f}",
"-c:a",
"pcm_s16le",
str(silence),
])
inputs = ["-i", str(silence)]
filter_parts = []
mix_inputs = ["[0:a]"]
input_index = 1
for segment in segments:
if not segment.generated_audio:
continue
inputs.extend(["-i", str(segment.generated_audio)])
delay_ms = max(0, int(round(segment.start * 1000)))
filter_parts.append(
f"[{input_index}:a]adelay={delay_ms}|{delay_ms},volume=1.0[d{input_index}]"
)
mix_inputs.append(f"[d{input_index}]")
input_index += 1
if len(mix_inputs) == 1:
run(["ffmpeg", "-y", "-i", str(silence), "-c:a", "pcm_s16le", str(out_path)])
return
filter_parts.append(f"{''.join(mix_inputs)}amix=inputs={len(mix_inputs)}:duration=longest:normalize=0[out]")
run(["ffmpeg", "-y", *inputs, "-filter_complex", ";".join(filter_parts), "-map", "[out]", "-t", f"{duration:.3f}", str(out_path)])
def duck_background(original_audio: Path, segments: list[Segment], out_path: Path, workdir: Path, duck_db: float) -> None:
volume = 10 ** (duck_db / 20.0)
mask = workdir / "dialogue_mask.wav"
duration = media_duration(original_audio)
silence = workdir / "mask_silence.wav"
run([
"ffmpeg",
"-y",
"-f",
"lavfi",
"-i",
"anullsrc=channel_layout=mono:sample_rate=44100",
"-t",
f"{duration:.3f}",
"-c:a",
"pcm_s16le",
str(silence),
])
tone_inputs = ["-i", str(silence)]
filter_parts = []
mix_inputs = ["[0:a]"]
for i, segment in enumerate(segments, 1):
tone = workdir / f"mask_{i:04d}.wav"
run([
"ffmpeg",
"-y",
"-f",
"lavfi",
"-i",
"aevalsrc=1:s=44100",
"-t",
f"{segment.duration:.3f}",
str(tone),
])
tone_inputs.extend(["-i", str(tone)])
delay_ms = int(round(segment.start * 1000))
filter_parts.append(f"[{i}:a]adelay={delay_ms}|{delay_ms}[m{i}]")
mix_inputs.append(f"[m{i}]")
filter_parts.append(f"{''.join(mix_inputs)}amix=inputs={len(mix_inputs)}:duration=longest:normalize=0,alimiter=limit=1[mask]")
run(["ffmpeg", "-y", *tone_inputs, "-filter_complex", ";".join(filter_parts), "-map", "[mask]", "-t", f"{duration:.3f}", str(mask)])
run([
"ffmpeg",
"-y",
"-i",
str(original_audio),
"-i",
str(mask),
"-filter_complex",
f"[0:a][1:a]sidechaincompress=threshold=0.01:ratio={1 / max(volume, 0.001):.3f}:attack=20:release=250[out]",
"-map",
"[out]",
"-c:a",
"pcm_s16le",
str(out_path),
])
def mix_audio(background: Path, dubbed: Path, out_path: Path, duration: float) -> None:
run([
"ffmpeg",
"-y",
"-i",
str(background),
"-i",
str(dubbed),
"-filter_complex",
"[0:a][1:a]amix=inputs=2:duration=longest:normalize=0,loudnorm=I=-16:TP=-1.5:LRA=11[out]",
"-map",
"[out]",
"-t",
f"{duration:.3f}",
"-c:a",
"aac",
"-b:a",
"192k",
str(out_path),
])
def mux_output(input_path: Path, final_audio: Path, output_path: Path, video_input: bool) -> None:
if video_input:
run([
"ffmpeg",
"-y",
"-i",
str(input_path),
"-i",
str(final_audio),
"-map",
"0:v:0",
"-map",
"1:a:0",
"-c:v",
"copy",
"-c:a",
"aac",
"-shortest",
str(output_path),
])
else:
run(["ffmpeg", "-y", "-i", str(final_audio), "-c:a", "aac", str(output_path)])
def parse_srt(text: str) -> list[Segment]:
blocks = re.split(r"\n\s*\n", text.strip())
segments: list[Segment] = []
for block in blocks:
lines = [line.strip("\ufeff ") for line in block.splitlines() if line.strip()]
if not lines:
continue
time_line_index = next((i for i, line in enumerate(lines) if "-->" in line), -1)
if time_line_index < 0:
continue
match = SRT_TIME_RE.search(lines[time_line_index])
if not match:
continue
body = " ".join(lines[time_line_index + 1 :]).strip()
if not body:
continue
index_text = lines[0] if time_line_index > 0 else str(len(segments) + 1)
try:
index = int(index_text)
except ValueError:
index = len(segments) + 1
segments.append(
Segment(
index=index,
start=srt_time_to_seconds(match.group("h"), match.group("m"), match.group("s"), match.group("ms")),
end=srt_time_to_seconds(match.group("eh"), match.group("em"), match.group("es"), match.group("ems")),
text=body,
)
)
return segments
def srt_time_to_seconds(h: str, m: str, s: str, ms: str) -> float:
return int(h) * 3600 + int(m) * 60 + int(s) + int(ms.ljust(3, "0")[:3]) / 1000.0
def extract_json(text: str) -> Any:
cleaned = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE).strip()
if cleaned.startswith("```"):
cleaned = re.sub(r"^```(?:json)?\s*", "", cleaned)
cleaned = re.sub(r"\s*```$", "", cleaned)
try:
return json.loads(cleaned)
except json.JSONDecodeError:
start = min((p for p in [cleaned.find("{"), cleaned.find("[")] if p >= 0), default=-1)
end = max(cleaned.rfind("}"), cleaned.rfind("]"))
if start >= 0 and end > start:
return json.loads(cleaned[start : end + 1])
raise RuntimeError(f"CoderAI chat did not return JSON:\n{text[:1000]}")
def file_data_uri(path: Path, mime: str) -> str:
return f"data:{mime};base64," + base64.b64encode(path.read_bytes()).decode("ascii")
def write_api_audio_item(item: dict[str, Any], out_path: Path, session: requests.Session) -> None:
for key in ("b64_wav", "b64_mp3", "b64_audio", "audio"):
if item.get(key):
raw = item[key]
if isinstance(raw, str) and raw.startswith("data:"):
raw = raw.split(",", 1)[1]
out_path.write_bytes(base64.b64decode(raw))
return
if item.get("url"):
response = session.get(item["url"], timeout=7200)
response.raise_for_status()
out_path.write_bytes(response.content)
return
raise RuntimeError(f"No audio payload found in API response item: {item.keys()}")
def choose_default_model(models: list[dict[str, Any]], capability: str) -> str | None:
for model in models:
if capability in (model.get("capabilities") or []):
return model.get("id")
return None
def env_default(service: str, field: str, fallback: str | None = None) -> str | None:
prefix = SERVICE_ENV_PREFIXES[service]
return os.environ.get(f"{prefix}_{field}") or fallback
def build_clients(args: argparse.Namespace) -> CoderAIClients:
default = CoderAIClient(args.base_url, args.api_key)
def service_client(service: str) -> CoderAIClient:
base_url = getattr(args, f"{service}_base_url") or args.base_url
api_key = getattr(args, f"{service}_api_key")
if api_key is None:
api_key = args.api_key
return CoderAIClient(base_url, api_key)
return CoderAIClients(
default=default,
transcribe=service_client("transcribe"),
text=service_client("text"),
voice=service_client("voice"),
convert=service_client("convert"),
stems=service_client("stems"),
)
def client_label(client: CoderAIClient) -> str:
return client.base_url
def assign_speakers(client: CoderAIClient, text_model: str, segments: list[Segment], max_speakers: int) -> None:
payload = [
{"id": s.index, "start": round(s.start, 3), "end": round(s.end, 3), "text": s.text}
for s in segments
]
system = "You assign dialogue subtitle segments to recurring speakers. Return only JSON."
user = textwrap.dedent(
f"""
Assign each segment to one of at most {max_speakers} stable speaker ids.
Use ids like speaker_01, speaker_02. Mark singing=true when the segment appears sung, lyrical, chanted, or is likely part of music.
Return JSON as: {{"segments":[{{"id":1,"speaker":"speaker_01","singing":false}}]}}
Segments:
{json.dumps(payload, ensure_ascii=False)}
"""
).strip()
try:
data = client.chat_json(text_model, system, user, max_tokens=4096)
by_id = {int(item["id"]): item for item in data.get("segments", [])}
for segment in segments:
item = by_id.get(segment.index, {})
segment.speaker = sanitize_name(str(item.get("speaker") or segment.speaker))
segment.is_singing = bool(item.get("singing", False))
except Exception as exc:
print(f"warning: speaker assignment failed, using automatic speakers: {exc}", file=sys.stderr)
for i, segment in enumerate(segments):
segment.speaker = f"speaker_{(i % max(1, max_speakers)) + 1:02d}"
def translate_segments(client: CoderAIClient, text_model: str, target_language: str, segments: list[Segment]) -> None:
batch_size = 40
system = "You translate dubbing scripts. Return only JSON."
for start in range(0, len(segments), batch_size):
batch = segments[start : start + batch_size]
payload = [
{
"id": s.index,
"source_text": s.text,
"duration_seconds": round(s.duration, 3),
"speaker": s.speaker,
"singing": s.is_singing,
}
for s in batch
]
user = textwrap.dedent(
f"""
Translate each segment to {target_language} for dubbing.
Preserve meaning, tone, speaker intent, and song lyric style when singing=true.
Keep the translation speakable within the provided duration. Prefer natural lip-sync/metric fit over literal word order.
Return JSON as: {{"segments":[{{"id":1,"translation":"..."}}]}}
Segments:
{json.dumps(payload, ensure_ascii=False)}
"""
).strip()
data = client.chat_json(text_model, system, user, max_tokens=4096)
by_id = {int(item["id"]): str(item.get("translation", "")).strip() for item in data.get("segments", [])}
for segment in batch:
segment.translated = by_id.get(segment.index) or segment.text
def fit_translation_metric(client: CoderAIClient, text_model: str, target_language: str, segments: list[Segment]) -> None:
system = "You adapt translated lines for dubbing timing and lip-sync. Return only JSON."
for segment in segments:
syllable_hint = max(2, int(segment.duration * 4.2))
user = textwrap.dedent(
f"""
Rewrite this {target_language} dub line so it fits about {segment.duration:.2f} seconds.
Aim for roughly {syllable_hint} syllables, preserve meaning, and keep it natural.
If singing is true, keep lyric rhythm and rhyme when possible.
Return JSON as: {{"translation":"..."}}
Original: {segment.text}
Current translation: {segment.translated}
Singing: {segment.is_singing}
"""
).strip()
try:
data = client.chat_json(text_model, system, user, max_tokens=512)
value = str(data.get("translation", "")).strip()
if value:
segment.translated = value
except Exception as exc:
print(f"warning: metric fitting failed for segment {segment.index}: {exc}", file=sys.stderr)
def sanitize_name(value: str) -> str:
cleaned = re.sub(r"[^a-zA-Z0-9_-]+", "_", value.strip().lower())
return cleaned[:48] or "speaker_01"
def create_voice_profiles(client: CoderAIClient, source_audio: Path, segments: list[Segment], workdir: Path, prefix: str) -> None:
by_speaker: dict[str, list[Segment]] = {}
for segment in segments:
by_speaker.setdefault(segment.speaker, []).append(segment)
for speaker, speaker_segments in by_speaker.items():
selected = sorted(speaker_segments, key=lambda s: s.duration, reverse=True)[:6]
start = max(0.0, selected[0].start - 0.05)
end = selected[0].end + 0.05
ref_audio = workdir / f"ref_{speaker}.wav"
slice_audio(source_audio, start, end, ref_audio)
transcript = selected[0].text.strip()
voice_name = sanitize_name(f"{prefix}_{speaker}")
print(f"creating voice profile {voice_name} from {start:.2f}-{end:.2f}s")
client.create_voice(voice_name, ref_audio, transcript, f"Auto-extracted by tools/video_dubber.py for {speaker}")
for segment in speaker_segments:
segment.voice_name = voice_name
segment.ref_audio = ref_audio
def generate_segment_audio(
voice_client: CoderAIClient,
convert_client: CoderAIClient,
source_audio: Path,
segments: list[Segment],
workdir: Path,
max_stretch: float,
preserve_singing: bool,
) -> None:
for n, segment in enumerate(segments, 1):
raw = workdir / f"dub_raw_{segment.index:04d}.wav"
fitted = workdir / f"dub_fit_{segment.index:04d}.wav"
speed = 1.0
if segment.translated:
approx_chars_per_sec = len(segment.translated) / segment.duration
if approx_chars_per_sec > 18:
speed = min(1.35, approx_chars_per_sec / 16)
print(f"[{n}/{len(segments)}] generating {segment.voice_name} {segment.duration:.2f}s")
if preserve_singing and segment.is_singing:
source_slice = workdir / f"sing_source_{segment.index:04d}.wav"
slice_audio(source_audio, segment.start, segment.end, source_slice)
try:
convert_client.convert_voice(
source_slice,
segment.voice_name,
raw,
target_voice=segment.ref_audio,
f0_condition=True,
length_adjust=1.0,
)
except Exception as exc:
print(f"warning: singing conversion failed for segment {segment.index}, falling back to cloned TTS: {exc}", file=sys.stderr)
voice_client.clone_voice(segment.voice_name, segment.translated or segment.text, speed, raw)
else:
voice_client.clone_voice(segment.voice_name, segment.translated or segment.text, speed, raw)
adjust_audio_timing(raw, segment.duration, fitted, max_stretch)
segment.generated_audio = fitted
def write_artifacts(segments: list[Segment], output_base: Path) -> None:
json_path = output_base.with_suffix(".segments.json")
srt_path = output_base.with_suffix(".translated.srt")
json_path.write_text(
json.dumps([dataclasses.asdict(s) | {"ref_audio": str(s.ref_audio or ""), "generated_audio": str(s.generated_audio or "")} for s in segments], indent=2, ensure_ascii=False),
encoding="utf-8",
)
lines = []
for i, segment in enumerate(segments, 1):
lines.append(str(i))
lines.append(f"{seconds_to_srt(segment.start)} --> {seconds_to_srt(segment.end)}")
lines.append(segment.translated or segment.text)
lines.append("")
srt_path.write_text("\n".join(lines), encoding="utf-8")
def seconds_to_srt(value: float) -> str:
value = max(0.0, value)
h = int(value // 3600)
m = int((value % 3600) // 60)
s = int(value % 60)
ms = int(round((value - math.floor(value)) * 1000))
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def parse_args(argv: Iterable[str]) -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Dub video/audio through CoderAI API while preserving music and effects.",
formatter_class=argparse.ArgumentDefaultsHelpFormatter,
)
parser.add_argument("input", type=Path, help="Input video or audio file")
parser.add_argument("-o", "--output", type=Path, help="Output media path")
parser.add_argument("-l", "--target-language", required=True, help="Target dubbing language, e.g. Italian, Spanish, ja")
parser.add_argument("--source-language", help="Optional source language hint for transcription")
parser.add_argument("--base-url", default=DEFAULT_BASE_URL, help="CoderAI API base URL")
parser.add_argument("--api-key", default=DEFAULT_API_KEY, help="CoderAI bearer token; defaults to CODERAI_API_KEY")
parser.add_argument("--audio-model", help="CoderAI transcription model id")
parser.add_argument("--text-model", default=env_default("text", "MODEL"), help="CoderAI text model id for translation and dialogue analysis")
parser.add_argument("--transcribe-base-url", default=env_default("transcribe", "BASE_URL"), help="Override CoderAI URL for /v1/audio/transcriptions")
parser.add_argument("--transcribe-api-key", default=env_default("transcribe", "API_KEY"), help="Override bearer token for transcription")
parser.add_argument("--transcribe-model", default=env_default("transcribe", "MODEL"), help="Alias for --audio-model; defaults to CODERAI_TRANSCRIBE_MODEL")
parser.add_argument("--text-base-url", default=env_default("text", "BASE_URL"), help="Override CoderAI URL for /v1/chat/completions")
parser.add_argument("--text-api-key", default=env_default("text", "API_KEY"), help="Override bearer token for text requests")
parser.add_argument("--voice-base-url", default=env_default("voice", "BASE_URL"), help="Override CoderAI URL for /v1/audio/voices and /v1/audio/clone")
parser.add_argument("--voice-api-key", default=env_default("voice", "API_KEY"), help="Override bearer token for voice cloning/profile requests")
parser.add_argument("--convert-base-url", default=env_default("convert", "BASE_URL"), help="Override CoderAI URL for /v1/audio/convert")
parser.add_argument("--convert-api-key", default=env_default("convert", "API_KEY"), help="Override bearer token for voice conversion")
parser.add_argument("--stems-base-url", default=env_default("stems", "BASE_URL"), help="Override CoderAI URL for /v1/audio/stems")
parser.add_argument("--stems-api-key", default=env_default("stems", "API_KEY"), help="Override bearer token for stem separation")
parser.add_argument("--max-speakers", type=int, default=8, help="Maximum recurring speaker voices to infer")
parser.add_argument("--voice-prefix", default="dub", help="Prefix for saved CoderAI voice profiles")
parser.add_argument("--no-stems", action="store_true", help="Do not call /v1/audio/stems; use local ducking to preserve background")
parser.add_argument("--stem-fallback", action="store_true", help="Ask CoderAI stems endpoint to use its ffmpeg fallback mode")
parser.add_argument("--no-metric-fit", action="store_true", help="Skip second LLM pass for tighter metric/lip-sync adaptation")
parser.add_argument("--no-singing-convert", action="store_true", help="Do not use /v1/audio/convert for singing segments")
parser.add_argument("--duck-db", type=float, default=-14.0, help="Dialogue-region background ducking target in dB when stems are disabled/unavailable")
parser.add_argument("--max-stretch", type=float, default=1.35, help="Maximum local time stretch/compress factor for generated lines")
parser.add_argument("--keep-workdir", type=Path, help="Keep intermediate files in this directory")
return parser.parse_args(list(argv))
def main(argv: Iterable[str]) -> int:
args = parse_args(argv)
require_binary("ffmpeg")
require_binary("ffprobe")
input_path = args.input.expanduser().resolve()
if not input_path.exists():
raise SystemExit(f"Input file not found: {input_path}")
video_input = is_video(input_path)
output_path = args.output
if output_path is None:
suffix = ".mp4" if video_input else ".m4a"
output_path = input_path.with_name(f"{input_path.stem}.dubbed.{sanitize_name(args.target_language)}{suffix}")
output_path = output_path.expanduser().resolve()
clients = build_clients(args)
transcribe_models = clients.transcribe.list_models()
text_models = clients.text.list_models()
audio_model = args.audio_model or args.transcribe_model or choose_default_model(transcribe_models, "audio_transcription") or "whisper"
text_model = args.text_model or choose_default_model(text_models, "text_generation")
if not text_model:
raise SystemExit("No text model found. Pass --text-model with a CoderAI chat model id.")
work_context = tempfile.TemporaryDirectory(prefix="coderai-dub-") if args.keep_workdir is None else None
workdir = args.keep_workdir or Path(work_context.name) # type: ignore[union-attr]
workdir.mkdir(parents=True, exist_ok=True)
try:
source_audio = workdir / "source.wav"
extract_audio(input_path, source_audio)
total_duration = media_duration(source_audio)
print(f"transcribing with {audio_model} via {client_label(clients.transcribe)}")
segments = clients.transcribe.transcribe(source_audio, audio_model, args.source_language)
segments = [s for s in segments if s.text.strip() and s.duration >= 0.08]
if not segments:
raise RuntimeError("No dialogue segments found in the input")
print(f"assigning speakers and singing flags with {text_model} via {client_label(clients.text)}")
assign_speakers(clients.text, text_model, segments, args.max_speakers)
print(f"translating {len(segments)} segments to {args.target_language} via {client_label(clients.text)}")
translate_segments(clients.text, text_model, args.target_language, segments)
if not args.no_metric_fit:
print("fitting translated lines to segment metrics")
fit_translation_metric(clients.text, text_model, args.target_language, segments)
run_prefix = sanitize_name(f"{args.voice_prefix}_{input_path.stem}_{int(time.time())}")
print(f"creating voice profiles via {client_label(clients.voice)}")
create_voice_profiles(clients.voice, source_audio, segments, workdir, run_prefix)
generate_segment_audio(
clients.voice,
clients.convert,
source_audio,
segments,
workdir,
args.max_stretch,
preserve_singing=not args.no_singing_convert,
)
dub_track = workdir / "dub_track.wav"
build_dub_track(segments, total_duration, dub_track, workdir)
background = workdir / "background.wav"
stems = None
if not args.no_stems:
print(f"requesting CoderAI stem separation via {client_label(clients.stems)}")
try:
stems = clients.stems.separate_stems(source_audio, workdir, args.stem_fallback)
except Exception as exc:
print(f"warning: stems unavailable, using local dialogue ducking: {exc}", file=sys.stderr)
if stems:
_, instrumental = stems
run(["ffmpeg", "-y", "-i", str(instrumental), "-t", f"{total_duration:.3f}", "-c:a", "pcm_s16le", str(background)])
else:
duck_background(source_audio, segments, background, workdir, args.duck_db)
final_audio = workdir / "final_audio.m4a"
mix_audio(background, dub_track, final_audio, total_duration)
mux_output(input_path, final_audio, output_path, video_input)
write_artifacts(segments, output_path)
print(f"wrote {output_path}")
print(f"wrote {output_path.with_suffix('.segments.json')}")
print(f"wrote {output_path.with_suffix('.translated.srt')}")
if args.keep_workdir:
print(f"kept workdir {workdir}")
return 0
finally:
if work_context is not None:
work_context.cleanup()
if __name__ == "__main__":
try:
raise SystemExit(main(sys.argv[1:]))
except KeyboardInterrupt:
raise SystemExit(130)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment