• Your Name's avatar
    Fix: In ondemand mode, fully unload current model before loading new one · 7d838962
    Your Name authored
    - In ondemand mode (no --load-all or --loadswap specified), when a new model
      is requested, the current model in VRAM is now fully unloaded before loading
      the new one. This ensures clean model switching.
    - Added cleanup logic to both /v1/chat/completions and /v1/completions endpoints
    - Added same logic to image generation endpoints (diffusers and sd.cpp paths)
    - Cleanup includes: model cleanup, gc.collect(), torch.cuda.empty_cache()
    7d838962
Name
Last commit
Last update
.vscode Loading commit data...
codai Loading commit data...
.gitignore Loading commit data...
LICENSE.md Loading commit data...
README.md Loading commit data...
build.sh Loading commit data...
coder Loading commit data...
coderai Loading commit data...
requirements-nvidia.txt Loading commit data...
requirements-vulkan.txt Loading commit data...
requirements.txt Loading commit data...