Commits · 6413d14fbe8fb29e63529c9b57ffa6b281e27640 · nexlab / coderai

08 Mar, 2026 1 commit
- Debug Vulkan single GPU mode and add GGML_VULKAN_DEVICE env var · 6413d14f
  Stefy Lanza (nextime / spora ) authored Mar 08, 2026
  
  6413d14f
07 Mar, 2026 3 commits
- Detect chat template from model and use appropriate formatting - avoid Jinja... · 8d484ec2
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
```
Detect chat template from model and use appropriate formatting - avoid Jinja errors by using manual formatting when template detection fails
```
  8d484ec2
- Fix Jinja2 template error - properly handle multipart content arrays and tool_calls format · 576a6cfe
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  576a6cfe
- Fix Jinja2 template error in Vulkan backend - ensure all messages have content attribute · 08eee40c
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  08eee40c
05 Mar, 2026 3 commits

Add fallback for models that don't support load_in_4bit quantization · e7e2c626

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

Modify _try_load_model() to catch TypeError when quantization arguments
are not supported by the model class. When this happens, the method now:
1. Warns the user about unsupported quantization
2. Retries loading the model without quantization arguments
3. Returns the model successfully if loading works

This fixes issues with models like Qwen3.5 that don't support
bitsandbytes quantization.

e7e2c626

Add OOM handling during generation to prevent crashes · 33a7e421

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

- Wrap generate() with try-except to catch CUDA OOM errors
- On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing
- Wrap generate_stream() thread with error handling using shared variable
- Yield error messages to client instead of crashing the process
- Allows server to continue running after generation OOM

33a7e421

Add --max-gpu-percent parameter for fine-grained GPU memory control · d62bdffb

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

This new parameter allows users to specify the exact percentage of GPU VRAM
to use, overriding the offload-strategy. When specified, the model will:
1. Use up to max-gpu-percent of VRAM
2. Offload remaining weights to CPU RAM (--ram)
3. Overflow to disk (--offload-dir) if RAM exhausted
4. Automatically fallback in 5% steps if OOM occurs

Example usage for RTX 3090 with Qwen3.5-35B-A3B:
  coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64

This ensures MoE models with high VRAM requirements during generation
can run without OOM by using CPU RAM as the primary offload target.

d62bdffb

01 Mar, 2026 33 commits
- Add sequential offload strategy with fine-grained 2% VRAM incremental steps · e23c3f7f
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  e23c3f7f
- Add --offload-strategy parameter for NVIDIA backend with 4 strategy options · d9a5d274
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  d9a5d274
- Disable bitsandbytes quantization for Qwen3.5-A3B/MoE models which don't support it · 10d10573
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  10d10573
- Add 'a3b' to MoE model indicators to recognize Qwen3.5-A3B architecture · 8665016a
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  8665016a
- Add MoE model detection with 80% VRAM limit for generation headroom · 4041187d
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  4041187d
- Add GPU size-aware VRAM limits: 99% for <3GB, 96% for 3-8GB, 93% for >8GB · 13bb1675
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  13bb1675
- Add automatic OOM handling with progressive VRAM reduction fallback for NVIDIA backend · b30c4c04
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  b30c4c04
- Change NVIDIA backend VRAM limit from 99.9% to 93% to leave more headroom for CUDA overhead · 320ca0e7
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  320ca0e7
- Fix imports in coder CLI and add tokenizer dependencies + GGUF error detection · 2ca7368f
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  2ca7368f
- Fix single message mode, add --no-prompt flag, disable confirmations in non-interactive mode · 905dc92d
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  905dc92d
- Remove backup file · 14de0c00
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  14de0c00
- Improve multiline input handling - support paste, bracket detection, empty line finish · 0950cd2f
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  0950cd2f
- Save sessions immediately on creation and after context compression · 3e4b1ca6
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  3e4b1ca6
- Add session management, readline history, context compression, --ctx, --micro... · 1de9996c
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
```
Add session management, readline history, context compression, --ctx, --micro flags, and context counter in prompt
```
  1de9996c
- Hide raw tool output unless --debug flag is specified · 55fbd847
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  55fbd847
- Add --debug flag to coder CLI, hide raw tool calls unless debug mode is enabled · 59a9eb85
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  59a9eb85
- Fix streaming display in coder CLI - use iter_lines for immediate output,... · f9efda2b
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
```
Fix streaming display in coder CLI - use iter_lines for immediate output, remove threading timer, simplify tool parsing
```
  f9efda2b
- Add --endpoint, --token, and --model CLI arguments for temporary overrides · 81da8cc5
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  81da8cc5
- Fix tool_call regex to handle multiline JSON · ade8d849
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  ade8d849
- Add XML tool format parser and filter tools from thinking display · 30f3e8a0
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  30f3e8a0
- Fix thinking display to update on every chunk and fix timer thread · e4cb426b
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  e4cb426b
- Fix thinking display with timer thread and parse tool_call tags from content · bcd150e2
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  bcd150e2
- Fix thinking display to use single line with proper timer updates · 0c31d3fd
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  0c31d3fd
- Add .gitignore and remove cached files · 0d76e514
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  0d76e514
- Add tool confirmation and fix thinking display in coder CLI · 09edf3bd
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  09edf3bd
- Add visual separator and multiline input support · 55810d7b
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  55810d7b
- Show thinking as single self-overwriting line with timer · fc4f93f7
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  fc4f93f7
- Add colorful CLI with CoderCLI> prompt and /command shortcuts · 7e0e358b
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  7e0e358b
- Add --small and --tiny args, show thinking content with timer · 8eee7e27
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  8eee7e27
- Add --timeout arg (default 600s) and graceful thinking display · 19452eb8
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  19452eb8
- Fix CLI streaming to use iter_content with smaller chunks for real-time output · dc604d6c
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  dc604d6c
- Collect all chunks in thread pool before yielding to avoid generator issues · 47738566
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  47738566
- Simplify async streaming using run_in_executor instead of manual thread · bf2b3b0a
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  bf2b3b0a