• Your Name's avatar
    Fix: In ondemand mode, fully unload current model before loading new one · 7d838962
    Your Name authored
    - In ondemand mode (no --load-all or --loadswap specified), when a new model
      is requested, the current model in VRAM is now fully unloaded before loading
      the new one. This ensures clean model switching.
    - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints
    - Added same logic to image generation endpoints (diffusers and sd.cpp paths)
    - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
    7d838962
Name
Last commit
Last update
..
api Loading commit data...
backends Loading commit data...
models Loading commit data...
pydantic Loading commit data...
queue Loading commit data...
__init__.py Loading commit data...
cli.py Loading commit data...
main.py Loading commit data...