-
Stefy Lanza (nextime / spora ) authored
ds4-server exposes several CUDA tunables only via environment, not CLI flags. By default ds4 reserves half the card for non-cache use and allocates the model weight arena in 1792 MiB chunks — both starve / OOM the streaming expert cache on small-weight MoE models served from SSD. Pass an explicit env to ds4-server (Popen now sets env=) with: - expert_cache_reserve_gb: typed knob -> DS4_CUDA_STREAMING_EXPERT_CACHE_RESERVE_GB (0 = leave ds4's default). - extra_env: free-form KEY=VALUE passthrough for the rest, e.g. DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512 to shrink the weight-arena chunk so it fits a heap fragmented by the expert cache. Both surfaced in Settings (config + admin GET/POST + UI), default to no-op so behaviour is unchanged unless set. Co-Authored-By:Claude Opus 4.8 <noreply@anthropic.com>
7fc393d4