Add --max-gpu-percent parameter for fine-grained GPU memory control
This new parameter allows users to specify the exact percentage of GPU VRAM to use, overriding the offload-strategy. When specified, the model will: 1. Use up to max-gpu-percent of VRAM 2. Offload remaining weights to CPU RAM (--ram) 3. Overflow to disk (--offload-dir) if RAM exhausted 4. Automatically fallback in 5% steps if OOM occurs Example usage for RTX 3090 with Qwen3.5-35B-A3B: coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64 This ensures MoE models with high VRAM requirements during generation can run without OOM by using CPU RAM as the primary offload target.
Showing
Please
register
or
sign in
to comment