Commits · 6bd4dc9117f8d7f24f06116394e383ecacec45ba · nexlab / coderai

08 Mar, 2026 26 commits

Use bare except to suppress llama.cpp __del__ errors · 6bd4dc91
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

6bd4dc91
Suppress llama.cpp __del__ errors during pre-load · f9739fe3
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

f9739fe3
Remove traceback print for optional audio pre-load · ba8e4792
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

ba8e4792
Add clearer message when audio model loads on-demand · e554baef
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

e554baef
Try faster-whisper first for audio pre-load, fall back to GGUF · bae50d66
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

bae50d66
Use download_model helper for audio pre-load with progress · 4f6d64d4
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

4f6d64d4
Add download_model helper with progress: size, total, speed · b622fe9e
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

b622fe9e
Add better error handling for GGUF audio model loading · 23fe4347
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

23fe4347

Add GGUF audio model support with llama.cpp (Vulkan) · 3daca858

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

When audio model is in GGUF format, use llama.cpp instead of faster-whisper
for pre-loading. This allows using Vulkan backend for audio transcription.

3daca858

Auto-pre-load single model when only one model type is configured · 833a4ff3

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

When only one model type is specified (e.g., only --audio-model with no
--model), automatically pre-load it even in on-demand mode. This ensures
the model is downloaded and ready for use.

833a4ff3

Add model pre-loading support (--loadall, --loadswap) and fix duplicate code bug · 6310e8b1

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --loadall flag to pre-load all models at startup
- Add --loadswap flag to keep models in RAM, swap active to VRAM
- Fix bug where load_mode was used before being defined in audio model section
- Remove duplicate load_mode determination code
- Improve error message for no main model specified to include TTS

6310e8b1

Add audio model pre-loading at startup when --loadall is used · 7651468e
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

7651468e

Add TTS support with kokoro-python and model caching improvements · ebd4acbb

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --tts-model option for Kokoro TTS models
- Add /v1/audio/speech endpoint (OpenAI-compatible)
- Add model caching to prevent redundant downloads
- Replace MD5 with SHA-256 for cache keys
- Move hashlib and pathlib imports to module level

ebd4acbb

Make --model optional when --audio-model or --image-model are specified · 10dc9f5c

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- --model is now optional if using audio or image models only
- Shows helpful error message with examples if no model specified
- Prints available models at startup

10dc9f5c

Support full URLs for model paths · 3ae1869a

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Accept full HTTPS URLs for --model (Vulkan/GGUF models)
- Accept full HTTPS URLs for --audio-model (faster-whisper models)
- Downloads file to temp directory before loading
- Shows download progress percentage

3ae1869a

Add --debug flag to dump full requests and replies · c12c55d6

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --debug CLI argument to enable debug mode
- When enabled, dumps full request body (no truncation)
- When enabled, dumps full generated text (no truncation)
- When enabled, dumps extracted tool calls in JSON format
- Useful for troubleshooting tool call issues

c12c55d6

Fix Pydantic deprecation warnings and Jinja2 crash · 910238ba

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Replace class-based Config with model_config = ConfigDict() in all Pydantic models
- Fix Jinja2 crash by ensuring all messages have content key that is never None
- Enhanced message cleaning in generate_chat and generate_chat_stream to create copies and ensure content is always a string
- Add final safety check in chat_completions endpoint for content handling

910238ba

Fix Jinja2 crash: ensure content key always exists in messages · f8618ce8

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add explicit check for missing content key in message dictionaries
- Use more aggressive regex patterns in strip_tool_calls_from_content
- Handle tool call tags in various formats (JSON, XML, tool names)
- Add checks in format_messages, _manual_format_messages, and chat_completions endpoint
- Fixes: 'dict object' has no attribute 'content' error in Jinja2 templates

f8618ce8

Fix Jinja2 error: ensure no message has None content in VulkanBackend · 4296b440

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Added safety check in generate_chat_stream to replace None content with empty string
- Added same check in generate_chat for consistency
- This prevents 'dict object has no attribute content' error when
  processing messages with tool_calls that have no text content

4296b440

feat: Add multi-model support for audio transcription and image generation · 1cdfe825

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --audio-model and --image-model CLI arguments
- Add --loadall, --audio-ctx, --audio-offload, --vision-ctx, --vision-offload args
- Implement MultiModelManager class for dynamic model switching
- Add POST /v1/audio/transcriptions endpoint (OpenAI-compatible)
- Add POST /v1/images/generations endpoint (OpenAI-compatible)
- Update endpoints to use multi_model_manager for model selection
- Audio uses faster-whisper for local transcription
- Images use Stable Diffusion via diffusers

1cdfe825

Fix Jinja2 crash and tool call filtering · eb6b8d85

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Fix: Handle None content in messages to prevent Jinja2 'dict object has no attribute content' error
  - Added safety check in chat_completions function
  - Fixed _manual_format_messages to explicitly check for None
  - Fixed format_messages in VulkanBackend to ensure content is never None

- Fix: Always filter tool call format from output
  - Changed filter to run unconditionally (not just when tools are present)
  - Added extra regex patterns for JSON format tool calls like <tool>{...}</tool>

- Also fixed: Minor typos in comments (cket ->cket)

eb6b8d85

Fix tool parsing: deduplicate tool calls, strip raw format from streaming content · 886ea8f4

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add seen_signatures set to extract_tool_calls() to prevent duplicates
- Add strip_tool_calls_from_content() method to remove <tool>...</tool> tags
- Filter tool format from each chunk in real-time during streaming
- Simplify post-stream tool call handling since content is already cleaned
- Also handle non-streaming responses for tool call content cleanup

886ea8f4

Add CUDA build option for llama-cpp-python · 821e40dd
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

821e40dd
Create separate venv for each backend: venv_nvidia, venv_vulkan, venv_vulkan_nvidia · 58f4382d
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

58f4382d
Add vulkan-nvidia build option · 0b0a9798
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

0b0a9798
Debug Vulkan single GPU mode and add GGML_VULKAN_DEVICE env var · 6413d14f
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

6413d14f

07 Mar, 2026 3 commits
- Detect chat template from model and use appropriate formatting - avoid Jinja... · 8d484ec2
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
```
Detect chat template from model and use appropriate formatting - avoid Jinja errors by using manual formatting when template detection fails
```
  8d484ec2
- Fix Jinja2 template error - properly handle multipart content arrays and tool_calls format · 576a6cfe
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  576a6cfe
- Fix Jinja2 template error in Vulkan backend - ensure all messages have content attribute · 08eee40c
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  08eee40c
05 Mar, 2026 3 commits

Add fallback for models that don't support load_in_4bit quantization · e7e2c626

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

Modify _try_load_model() to catch TypeError when quantization arguments
are not supported by the model class. When this happens, the method now:
1. Warns the user about unsupported quantization
2. Retries loading the model without quantization arguments
3. Returns the model successfully if loading works

This fixes issues with models like Qwen3.5 that don't support
bitsandbytes quantization.

e7e2c626

Add OOM handling during generation to prevent crashes · 33a7e421

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

- Wrap generate() with try-except to catch CUDA OOM errors
- On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing
- Wrap generate_stream() thread with error handling using shared variable
- Yield error messages to client instead of crashing the process
- Allows server to continue running after generation OOM

33a7e421

Add --max-gpu-percent parameter for fine-grained GPU memory control · d62bdffb

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

This new parameter allows users to specify the exact percentage of GPU VRAM
to use, overriding the offload-strategy. When specified, the model will:
1. Use up to max-gpu-percent of VRAM
2. Offload remaining weights to CPU RAM (--ram)
3. Overflow to disk (--offload-dir) if RAM exhausted
4. Automatically fallback in 5% steps if OOM occurs

Example usage for RTX 3090 with Qwen3.5-35B-A3B:
  coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64

This ensures MoE models with high VRAM requirements during generation
can run without OOM by using CPU RAM as the primary offload target.

d62bdffb

01 Mar, 2026 8 commits
- Add sequential offload strategy with fine-grained 2% VRAM incremental steps · e23c3f7f
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  e23c3f7f
- Add --offload-strategy parameter for NVIDIA backend with 4 strategy options · d9a5d274
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  d9a5d274
- Disable bitsandbytes quantization for Qwen3.5-A3B/MoE models which don't support it · 10d10573
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  10d10573
- Add 'a3b' to MoE model indicators to recognize Qwen3.5-A3B architecture · 8665016a
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  8665016a
- Add MoE model detection with 80% VRAM limit for generation headroom · 4041187d
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  4041187d
- Add GPU size-aware VRAM limits: 99% for <3GB, 96% for 3-8GB, 93% for >8GB · 13bb1675
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  13bb1675
- Add automatic OOM handling with progressive VRAM reduction fallback for NVIDIA backend · b30c4c04
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  b30c4c04
- Change NVIDIA backend VRAM limit from 99.9% to 93% to leave more headroom for CUDA overhead · 320ca0e7
  Stefy Lanza (nextime / spora ) authored Mar 01, 2026
  
  320ca0e7