codai/models/acceleration.py · 99f8ba859fd0aed703d506de9c9fe865cacb90a4 · nexlab / coderai

VRAM estimation (manager.py): - Weight the effective quant multiplier by REAL per-component parameter shares (new _component_param_shares scans safetensors by component folder) instead of a blind 70/30 split. Wan2.2 is 99.6% quantizable (two 14B experts + text encoder 4-bit, only the 0.13B VAE dense), so the old 0.475x multiplier inflated ~25.8 GB -> 42.7 GB and forced needless offload. Now ~0.28x -> ~25.8 GB. VAE forced dense (conv-only, bnb can't quantize). Auto offload decision (video.py): - 'auto': when peak footprint exceeds free VRAM, go straight to `model` CPU offload (active component on GPU, near full-GPU speed) — no full-GPU gamble, no slow balanced+disk path. - 'auto-borderline' (new mode): same, except a marginal overshoot (<=3 GB) tries full-GPU first to keep both experts resident and use free VRAM, falling back to model offload on OOM. Acceleration LoRA (acceleration.py + video.py): - Keep the distill/Lightning LoRA as an ACTIVE RUNTIME ADAPTER instead of fusing. Fusing into CPU-offloaded bitsandbytes 4-bit weights triggers a dequant->merge->requant per Linear on the CPU — minutes/hours per expert, appearing to hang (high CPU, empty VRAM). Runtime adapters apply at forward time on the GPU at negligible cost and natively cover transformer_2. - _sync_video_loras preserves the accel adapters across per-request LoRA swaps and re-includes them in every set_adapters; _unload_video_loras deletes only per-request adapters, keeping accel. UI (models.html): - Add "Auto borderline-aware" offload strategy option + updated hint. Co-Authored-By:

Claude Opus 4.8 <noreply@anthropic.com>

acceleration.py 15.5 KB

Replace acceleration.py