• Stefy Lanza (nextime / spora )'s avatar
    whisper: account a running runner as a loaded model for VRAM eviction · 2a214215
    Stefy Lanza (nextime / spora ) authored
    Starting a whisper-server runner loads the gguf onto the GPU, but it was
    invisible to the VRAM-eviction logic — it never evicted others to make room,
    recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.
    
    - WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
      free its VRAM like any other model.
    - MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
      other models if free VRAM is short, start the subprocess, and register it in
      models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
      trigger for eviction and an eviction candidate.
    - stop_whisper_server(): stop + clear all that accounting (frees VRAM).
    - Routed every start/stop through these: on-request transcription, engine
      startup pre-load, admin model-load (Load button) and model-unload/disable.
    
    So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    2a214215
Name
Last commit
Last update
..
admin Loading commit data...
api Loading commit data...
backends Loading commit data...
broker Loading commit data...
frontproxy Loading commit data...
models Loading commit data...
openai Loading commit data...
pydantic Loading commit data...
queue Loading commit data...
tasks Loading commit data...
__init__.py Loading commit data...
cli.py Loading commit data...
config.py Loading commit data...
main.py Loading commit data...
platform_paths.py Loading commit data...