• Stefy Lanza (nextime / spora )'s avatar
    video: param-weighted VRAM estimate, smarter auto offload, runtime accel LoRA · eeb3bba1
    Stefy Lanza (nextime / spora ) authored
    VRAM estimation (manager.py):
    - Weight the effective quant multiplier by REAL per-component parameter
      shares (new _component_param_shares scans safetensors by component folder)
      instead of a blind 70/30 split. Wan2.2 is 99.6% quantizable (two 14B
      experts + text encoder 4-bit, only the 0.13B VAE dense), so the old 0.475x
      multiplier inflated ~25.8 GB -> 42.7 GB and forced needless offload. Now
      ~0.28x -> ~25.8 GB. VAE forced dense (conv-only, bnb can't quantize).
    
    Auto offload decision (video.py):
    - 'auto': when peak footprint exceeds free VRAM, go straight to `model` CPU
      offload (active component on GPU, near full-GPU speed) — no full-GPU gamble,
      no slow balanced+disk path.
    - 'auto-borderline' (new mode): same, except a marginal overshoot (<=3 GB)
      tries full-GPU first to keep both experts resident and use free VRAM,
      falling back to model offload on OOM.
    
    Acceleration LoRA (acceleration.py + video.py):
    - Keep the distill/Lightning LoRA as an ACTIVE RUNTIME ADAPTER instead of
      fusing. Fusing into CPU-offloaded bitsandbytes 4-bit weights triggers a
      dequant->merge->requant per Linear on the CPU — minutes/hours per expert,
      appearing to hang (high CPU, empty VRAM). Runtime adapters apply at forward
      time on the GPU at negligible cost and natively cover transformer_2.
    - _sync_video_loras preserves the accel adapters across per-request LoRA swaps
      and re-includes them in every set_adapters; _unload_video_loras deletes only
      per-request adapters, keeping accel.
    
    UI (models.html):
    - Add "Auto borderline-aware" offload strategy option + updated hint.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    eeb3bba1
acceleration.py 15.5 KB