Exact filename or suffix pattern. Leave blank to download all .gguf files.
Will download the full repository using the HuggingFace snapshot API. This is the correct method for safetensors / non-GGUF models. Large repos may take a while.
Preparing…
———0%
Upload GGUF model
Uploading…
0%—
Edit whisper-server model
Configure model
Capabilities
Text / Language
Image generation & editing
Image analysis & processing
Video
Audio
3D
Shown on the config pill. Falls back to alias, then "Config N".
Backend
Pin this model to a specific engine/card. Overrides the default engine. Only shown when multiple engines are running.When off (default), a request fails if the pinned engine is down/busy-unreachable instead of running on a different card.
These tune ds4-server for THIS model (they vary by quant/size/context). Leave blank to inherit the global ds4 settings. Context window is set by n_ctx above.
On for big models that don't fit VRAM; off for small ones that do.
Set just above this model's resident weights (+~2 GiB). ds4 defaults to half the card, which starves the cache.
Cap the expert cache (a count or NGB), prefill-chunk, kv-disk-space, etc. Blank = inherit global.
KEY=VALUE pairs. e.g. DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512 fixes weight-arena OOM under heavy expert caching. Blank = inherit global.
⚡ Performance: put the GGUF on NVMe/SSD (≈10× prefill for a streamed MoE) ·
cap the expert cache with --ssd-streaming-cache-experts N (a count, e.g. 400) — higher helps decode but too high OOMs the weight arena ·
set reserve ≈ weights + ~2 GiB · add DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512 · avoidDS4_CUDA_WEIGHT_CACHE (slower). Decode of a model bigger than VRAM is streaming-bound (~0.5 tok/s) — the real fix is a smaller quant. Context = n_ctx above.
When on, the real measured VRAM is re-recorded every run and used for eviction even if "Used VRAM" is set above.
Parallel copies in VRAM
Inference
-1 = all on GPU (auto-offloads the overflow to CPU if it won't fit). N = N layers on GPU, the rest on CPU (slower, frees VRAM).
Pairs an mmproj GGUF with this model to add image/vision input (passed to llama.cpp as --mmproj). Auto-selected when a matching projector is detected.
Note: KV-cache quantization only applies to GGUF models on
the llama.cpp backend. HF-transformers models (incl. gemma and
other sliding-window / hybrid linear-attention architectures) ignore this and
keep an fp16 KV cache — their quantized-cache path crashes during generation.
For those, lower n_ctx to shrink the KV VRAM reserve instead.
Auto-compact context
Compacts to ~65% of the context window when the prompt reaches this %.
A separate (e.g. smaller/faster) model can write the summary while the main model answers. The dropped history is chunked to fit the chosen model's own context. Empty = the request's own model summarizes. Leave empty to inherit the global default.
By default the model's thinking is returned as a separate reasoning / reasoning_content field (think/thinking/thought tags, including the bare </think> Qwen pre-fill emits) and kept out of content. Enable this to discard it entirely. A request can override per-call with reasoning:{exclude:true}, reasoning_effort:"none", or suppress_reasoning.
bitsandbytes NF4 is the slowest 4-bit option. Quantizing to GPTQ/AWQ uses
Marlin kernels (2–4× faster) but is a heavy one-time background job; the
produced checkpoint is used automatically on the next load. Falls back to
bitsandbytes if unavailable or the architecture is unsupported.
Component Quantization (image / video pipelines)
Quantize individual pipeline components. Default uses the global 4-bit/8-bit checkboxes above.
2-bit needs optimum-quanto; GGUF file (Q5_K/Q6_K…) needs a pre-quantized .gguf path/URL.
Tip: backbone 4-bit + text encoder 8-bit + VAE None = best quality-per-GB.
Offload
Auto goes straight to model offload when the model's peak exceeds free VRAM (no full-GPU gamble). Auto borderline-aware is the same, except when the model only marginally overshoots (≤3 GB) it tries full-GPU first to keep both experts resident and use the free VRAM, falling back to model offload on OOM. model keeps only the active component on GPU (best for two-expert Wan2.2 — near full-GPU speed). group streams blocks with prefetch (lowest VRAM, fast; incompatible with bitsandbytes 4-bit → falls back to sequential). balanced fills GPU to the % below then spills.
sd.cpp Video Options
When the model won't fit entirely in VRAM, fill this % of free VRAM then spill the rest to CPU RAM. Also the cap used by auto-balanced fallback.
Lists distill-LoRA repos found in the local cache. Choose one, then pick its high/low (or single) LoRA below.
Wan2.2 A14B has two experts; the distill LoRA must be fused into both or the clip collapses to a solid colour at 4 steps. Leave blank to apply the single LoRA above to both.
A request's quantization field (e.g. turbo4) overrides the default bits. With encoding_format:"base64" the response is compact packed bytes; otherwise it returns the lossy reconstruction as floats.
Components
Tip: use <lora:name:strength> in prompts to apply at inference time
When this image model is a transformer/DiT (Z-Image, Flux, SD3) the in-process LoRA
trainer can't target it. Set a UNet-based SD1.x/SDXL model here to train identity LoRAs
against, while generation still uses this model. Leave empty to train on this model.