-
Stefy Lanza (nextime / spora ) authored
Starting a whisper-server runner loads the gguf onto the GPU, but it was invisible to the VRAM-eviction logic — it never evicted others to make room, recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted. - WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can free its VRAM like any other model. - MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict other models if free VRAM is short, start the subprocess, and register it in models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a trigger for eviction and an eviction candidate. - stop_whisper_server(): stop + clear all that accounting (frees VRAM). - Routed every start/stop through these: on-request transcription, engine startup pre-load, admin model-load (Load button) and model-unload/disable. So: starting a runner = a model load (evicts as needed); unloading = frees VRAM. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
2a214215