-
Stefy Lanza (nextime / spora ) authored
Add a server-wide host-RAM ceiling (OffloadConfig.max_ram_gb) alongside the existing VRAM budgeting: - hf_loading clamps the accelerate CPU-offload budget to the headroom under the cap, so overflow spills to the disk offload folder instead of growing RSS. - manager: process-tree RSS accounting, true-LRU (active_in_vram property stamps _last_used), shared _evict_one, and _evict_models_for_ram; idle models are evicted before a new load when RSS nears the cap. - ram_monitor.py: background watcher samples RSS, flags a suspected leak when it climbs while the scheduler is idle, and runs a mitigation ladder (gc -> empty_cache -> malloc_trim -> drop upscaler cache -> evict idle). - admin /status returns a ram block; Settings page exposes max RAM + evict/ leak-watch toggles (applied live); dashboard shows a RAM gauge + leak badge. Also fold loaded upscalers (_UPSCALER_CACHE) into the dashboard models-loaded count so an active upscale no longer reports '0 models loaded'. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
99f8ba85