1. 09 Mar, 2026 8 commits
  2. 08 Mar, 2026 27 commits
  3. 07 Mar, 2026 3 commits
  4. 05 Mar, 2026 2 commits
    • Stefy Lanza (nextime / spora )'s avatar
      Add fallback for models that don't support load_in_4bit quantization · e7e2c626
      Stefy Lanza (nextime / spora ) authored
      Modify _try_load_model() to catch TypeError when quantization arguments
      are not supported by the model class. When this happens, the method now:
      1. Warns the user about unsupported quantization
      2. Retries loading the model without quantization arguments
      3. Returns the model successfully if loading works
      
      This fixes issues with models like Qwen3.5 that don't support
      bitsandbytes quantization.
      e7e2c626
    • Stefy Lanza (nextime / spora )'s avatar
      Add OOM handling during generation to prevent crashes · 33a7e421
      Stefy Lanza (nextime / spora ) authored
      - Wrap generate() with try-except to catch CUDA OOM errors
      - On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing
      - Wrap generate_stream() thread with error handling using shared variable
      - Yield error messages to client instead of crashing the process
      - Allows server to continue running after generation OOM
      33a7e421