-
Your Name authored
- In ondemand mode (no --load-all or --loadswap specified), when a new model is requested, the current model in VRAM is now fully unloaded before loading the new one. This ensures clean model switching. - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints - Added same logic to image generation endpoints (diffusers and sd.cpp paths) - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
7d838962
| Name |
Last commit
|
Last update |
|---|---|---|
| .vscode | ||
| codai | ||
| .gitignore | ||
| LICENSE.md | ||
| README.md | ||
| build.sh | ||
| coder | ||
| coderai | ||
| requirements-nvidia.txt | ||
| requirements-vulkan.txt | ||
| requirements.txt |