codai · 2a2142150a9ce0121b98c43e7116a21c21230c88 · nexlab / coderai

whisper: account a running runner as a loaded model for VRAM eviction · 2a214215

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Starting a whisper-server runner loads the gguf onto the GPU, but it was
invisible to the VRAM-eviction logic — it never evicted others to make room,
recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.

- WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
  free its VRAM like any other model.
- MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
  other models if free VRAM is short, start the subprocess, and register it in
  models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
  trigger for eviction and an eviction candidate.
- stop_whisper_server(): stop + clear all that accounting (frees VRAM).
- Routed every start/stop through these: on-request transcription, engine
  startup pre-load, admin model-load (Load button) and model-unload/disable.

So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2a214215

Name	Last commit	Last update
..
admin		Loading commit data...
api		Loading commit data...
backends		Loading commit data...
broker		Loading commit data...
frontproxy		Loading commit data...
models		Loading commit data...
openai		Loading commit data...
pydantic		Loading commit data...
queue		Loading commit data...
tasks		Loading commit data...
__init__.py		Loading commit data...
cli.py		Loading commit data...
config.py		Loading commit data...
main.py		Loading commit data...
platform_paths.py		Loading commit data...