1. 08 Mar, 2026 1 commit
  2. 07 Mar, 2026 3 commits
  3. 05 Mar, 2026 3 commits
    • Stefy Lanza (nextime / spora )'s avatar
      Add fallback for models that don't support load_in_4bit quantization · e7e2c626
      Stefy Lanza (nextime / spora ) authored
      Modify _try_load_model() to catch TypeError when quantization arguments
      are not supported by the model class. When this happens, the method now:
      1. Warns the user about unsupported quantization
      2. Retries loading the model without quantization arguments
      3. Returns the model successfully if loading works
      
      This fixes issues with models like Qwen3.5 that don't support
      bitsandbytes quantization.
      e7e2c626
    • Stefy Lanza (nextime / spora )'s avatar
      Add OOM handling during generation to prevent crashes · 33a7e421
      Stefy Lanza (nextime / spora ) authored
      - Wrap generate() with try-except to catch CUDA OOM errors
      - On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing
      - Wrap generate_stream() thread with error handling using shared variable
      - Yield error messages to client instead of crashing the process
      - Allows server to continue running after generation OOM
      33a7e421
    • Stefy Lanza (nextime / spora )'s avatar
      Add --max-gpu-percent parameter for fine-grained GPU memory control · d62bdffb
      Stefy Lanza (nextime / spora ) authored
      This new parameter allows users to specify the exact percentage of GPU VRAM
      to use, overriding the offload-strategy. When specified, the model will:
      1. Use up to max-gpu-percent of VRAM
      2. Offload remaining weights to CPU RAM (--ram)
      3. Overflow to disk (--offload-dir) if RAM exhausted
      4. Automatically fallback in 5% steps if OOM occurs
      
      Example usage for RTX 3090 with Qwen3.5-35B-A3B:
        coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64
      
      This ensures MoE models with high VRAM requirements during generation
      can run without OOM by using CPU RAM as the primary offload target.
      d62bdffb
  4. 01 Mar, 2026 33 commits