codai/models · f0dcf7eb2a2687135132f13eb2662aea05585ebf · nexlab / coderai

whisper: account a running runner as a loaded model for VRAM eviction · 2a214215

Stefy Lanza (nextime / spora ) authored Jun 19, 2026

Starting a whisper-server runner loads the gguf onto the GPU, but it was
invisible to the VRAM-eviction logic — it never evicted others to make room,
recorded no footprint, and (lacking a cleanup()) couldn't itself be evicted.

- WhisperServerManager.cleanup() -> stop(), so _evict_one/unload_model can
  free its VRAM like any other model.
- MultiModelManager.start_whisper_server(): estimate the gguf footprint, evict
  other models if free VRAM is short, start the subprocess, and register it in
  models/models_in_vram/_measured_vram_gb (active_in_vram). It's now both a
  trigger for eviction and an eviction candidate.
- stop_whisper_server(): stop + clear all that accounting (frees VRAM).
- Routed every start/stop through these: on-request transcription, engine
  startup pre-load, admin model-load (Load button) and model-unload/disable.

So: starting a runner = a model load (evicts as needed); unloading = frees VRAM.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2a214215

Name	Last commit	Last update
..
cache		Loading commit data...
__init__.py		Loading commit data...
acceleration.py		Loading commit data...
capabilities.py		Loading commit data...
grammar.py		Loading commit data...
hf_loading.py		Loading commit data...
manager.py		Loading commit data...
parser.py		Loading commit data...
pipeline_cache.py		Loading commit data...
quant.py		Loading commit data...
ram_monitor.py		Loading commit data...
templates.py		Loading commit data...
thermal.py		Loading commit data...
tmp_janitor.py		Loading commit data...
tool_call_grammar.gbnf		Loading commit data...
turboquant.py		Loading commit data...
utils.py		Loading commit data...