-
Stefy Lanza (nextime / spora ) authored
Cache the Wan transformer expert(s) between consecutive trainings against the same base (keyed by base_path+quantize) so a back-to-back job skips the very slow reload (tens of minutes for A14B). Only this job's adapter + gradient-checkpoint hooks are removed at teardown; the base transformer(s) stay resident. Since 4-bit weights can't move to CPU, they hold GPU VRAM between jobs — so the external VRAM releaser now drops the Wan cache too when a generation needs the GPU, and the error path clears both caches. Also report training progress every step (cheap dict update) instead of every 10, so the web UI bar advances smoothly once steps begin. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
d1fc17e0