-
Your Name authored
- In ondemand mode (no --load-all or --loadswap specified), when a new model is requested, the current model in VRAM is now fully unloaded before loading the new one. This ensures clean model switching. - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints - Added same logic to image generation endpoints (diffusers and sd.cpp paths) - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
7d838962
| Name |
Last commit
|
Last update |
|---|---|---|
| .. | ||
| __init__.py | ||
| app.py | ||
| images.py | ||
| log.py | ||
| state.py | ||
| text.py | ||
| transcriptions.py | ||
| tts.py |