- 08 Mar, 2026 1 commit
-
-
Stefy Lanza (nextime / spora ) authored
-
- 07 Mar, 2026 3 commits
-
-
Stefy Lanza (nextime / spora ) authored
Detect chat template from model and use appropriate formatting - avoid Jinja errors by using manual formatting when template detection fails
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
- 05 Mar, 2026 3 commits
-
-
Stefy Lanza (nextime / spora ) authored
Modify _try_load_model() to catch TypeError when quantization arguments are not supported by the model class. When this happens, the method now: 1. Warns the user about unsupported quantization 2. Retries loading the model without quantization arguments 3. Returns the model successfully if loading works This fixes issues with models like Qwen3.5 that don't support bitsandbytes quantization.
-
Stefy Lanza (nextime / spora ) authored
- Wrap generate() with try-except to catch CUDA OOM errors - On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing - Wrap generate_stream() thread with error handling using shared variable - Yield error messages to client instead of crashing the process - Allows server to continue running after generation OOM
-
Stefy Lanza (nextime / spora ) authored
This new parameter allows users to specify the exact percentage of GPU VRAM to use, overriding the offload-strategy. When specified, the model will: 1. Use up to max-gpu-percent of VRAM 2. Offload remaining weights to CPU RAM (--ram) 3. Overflow to disk (--offload-dir) if RAM exhausted 4. Automatically fallback in 5% steps if OOM occurs Example usage for RTX 3090 with Qwen3.5-35B-A3B: coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64 This ensures MoE models with high VRAM requirements during generation can run without OOM by using CPU RAM as the primary offload target.
-
- 01 Mar, 2026 33 commits
-
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
Add session management, readline history, context compression, --ctx, --micro flags, and context counter in prompt
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
Fix streaming display in coder CLI - use iter_lines for immediate output, remove threading timer, simplify tool parsing
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-