• Your Name's avatar
    Fix: In ondemand mode, fully unload current model before loading new one · 7d838962
    Your Name authored
    - In ondemand mode (no --load-all or --loadswap specified), when a new model
      is requested, the current model in VRAM is now fully unloaded before loading
      the new one. This ensures clean model switching.
    - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints
    - Added same logic to image generation endpoints (diffusers and sd.cpp paths)
    - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
    7d838962
images.py 44.2 KB