-
Your Name authored
- Default mode changed to ondemand (pre-load first model, unload/load on switch) - loadswap: load first model in VRAM, others in CPU RAM, swap on switch - loadall: try to load all models in VRAM, offload to CPU RAM if OOM - --nopreload: skip pre-loading in any mode, load on first request - request_model() now properly handles all three modes - Added _move_model_to_cpu() and _move_model_to_vram() for loadswap - Fixed NameError: model_manager reference in request_model() (was using global singleton instead of self) - Updated CLI help text for --loadall, --loadswap, --nopreload
c08a5b4f
| Name |
Last commit
|
Last update |
|---|---|---|
| .. | ||
| cache | ||
| __init__.py | ||
| capabilities.py | ||
| grammar.py | ||
| manager.py | ||
| parser.py | ||
| templates.py | ||
| tool_call_grammar.gbnf | ||
| utils.py |