• Stefy Lanza (nextime / spora )'s avatar
    ds4: configurable CUDA env knobs (expert-cache reserve + free-form extra_env) · 7fc393d4
    Stefy Lanza (nextime / spora ) authored
    ds4-server exposes several CUDA tunables only via environment, not CLI flags.
    By default ds4 reserves half the card for non-cache use and allocates the model
    weight arena in 1792 MiB chunks — both starve / OOM the streaming expert cache
    on small-weight MoE models served from SSD.
    
    Pass an explicit env to ds4-server (Popen now sets env=) with:
      - expert_cache_reserve_gb: typed knob -> DS4_CUDA_STREAMING_EXPERT_CACHE_RESERVE_GB
        (0 = leave ds4's default).
      - extra_env: free-form KEY=VALUE passthrough for the rest, e.g.
        DS4_CUDA_WEIGHT_ARENA_CHUNK_MB=512 to shrink the weight-arena chunk so it
        fits a heap fragmented by the expert cache.
    
    Both surfaced in Settings (config + admin GET/POST + UI), default to no-op so
    behaviour is unchanged unless set.
    Co-Authored-By: 's avatarClaude Opus 4.8 <noreply@anthropic.com>
    7fc393d4
ds4_worker.py 20.5 KB