Commits · c8f70fe410a11c5e04e10e2bb3133552b0a1030e · nexlab / coderai

09 Mar, 2026 8 commits

Fix: Skip faster-whisper for GGUF files · c8f70fe4

Your Name authored Mar 09, 2026

faster-whisper doesn't support GGUF format (it's llama.cpp format).
Now detects GGUF files by extension and goes directly to whispercpp.

c8f70fe4

Fix: Fall back to whispercpp when faster-whisper fails to load · 11a0fd46

Your Name authored Mar 09, 2026

- Add faster_whisper_failed flag to properly track failures
- When faster-whisper throws non-ImportError (e.g., GGUF not supported),
  now falls back to whispercpp instead of failing
- Applies to both pre-loading and transcription endpoint

11a0fd46

Fix error handling for audio transcription when libraries unavailable · fee8a9dd

Your Name authored Mar 09, 2026

- Add specific detection for 'invalid ELF' / 'Mach-O' architecture mismatch errors
- Improve error messages to mention both options:
  - Install PyTorch + faster-whisper
  - Use built-in whispercpp model (tiny/base/small/medium/large)
- Fix critical bug: now raises HTTPException instead of returning None

fee8a9dd

Fix pre-loading to recognize built-in whispercpp model names · 2186b190

Your Name authored Mar 09, 2026

- Recognize built-in model names: tiny, base, small, medium, large-v1, large
- Allow pre-loading these models directly without file path

2186b190

Improve whispercpp error handling for HuggingFace GGUF files · f5142c1b

Your Name authored Mar 09, 2026

- Add better error detection for 'not a valid preconverted model' errors
- Provide clear guidance to users about whispercpp limitations
- Suggest installing faster-whisper with PyTorch or using built-in model names
- Update both transcription endpoint and pre-loading code

f5142c1b

Add whispercpp support for audio transcription without PyTorch · 44941ac6

Your Name authored Mar 09, 2026

- Update transcription endpoint to try faster-whisper first, then whispercpp
- Update pre-loading code to support both backends
- Add whispercpp to all requirements files (vulkan, nvidia, default)
- Remove broken llama.cpp fallback (llama.cpp cannot transcribe Whisper)

44941ac6

Add faster-whisper to requirements for audio transcription · 6ef7a2dd
Your Name authored Mar 09, 2026

6ef7a2dd
Add test files to .gitignore · 606747de
Your Name authored Mar 09, 2026

606747de

08 Mar, 2026 27 commits

Suppress unraisable LlamaModel.__del__ errors using sys.unraisablehook · f28c6185
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

f28c6185
Use bare except to suppress llama.cpp __del__ errors · 6bd4dc91
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

6bd4dc91
Suppress llama.cpp __del__ errors during pre-load · f9739fe3
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

f9739fe3
Remove traceback print for optional audio pre-load · ba8e4792
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

ba8e4792
Add clearer message when audio model loads on-demand · e554baef
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

e554baef
Try faster-whisper first for audio pre-load, fall back to GGUF · bae50d66
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

bae50d66
Use download_model helper for audio pre-load with progress · 4f6d64d4
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

4f6d64d4
Add download_model helper with progress: size, total, speed · b622fe9e
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

b622fe9e
Add better error handling for GGUF audio model loading · 23fe4347
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

23fe4347

Add GGUF audio model support with llama.cpp (Vulkan) · 3daca858

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

When audio model is in GGUF format, use llama.cpp instead of faster-whisper
for pre-loading. This allows using Vulkan backend for audio transcription.

3daca858

Auto-pre-load single model when only one model type is configured · 833a4ff3

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

When only one model type is specified (e.g., only --audio-model with no
--model), automatically pre-load it even in on-demand mode. This ensures
the model is downloaded and ready for use.

833a4ff3

Add model pre-loading support (--loadall, --loadswap) and fix duplicate code bug · 6310e8b1

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --loadall flag to pre-load all models at startup
- Add --loadswap flag to keep models in RAM, swap active to VRAM
- Fix bug where load_mode was used before being defined in audio model section
- Remove duplicate load_mode determination code
- Improve error message for no main model specified to include TTS

6310e8b1

Add audio model pre-loading at startup when --loadall is used · 7651468e
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

7651468e

Add TTS support with kokoro-python and model caching improvements · ebd4acbb

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --tts-model option for Kokoro TTS models
- Add /v1/audio/speech endpoint (OpenAI-compatible)
- Add model caching to prevent redundant downloads
- Replace MD5 with SHA-256 for cache keys
- Move hashlib and pathlib imports to module level

ebd4acbb

Make --model optional when --audio-model or --image-model are specified · 10dc9f5c

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- --model is now optional if using audio or image models only
- Shows helpful error message with examples if no model specified
- Prints available models at startup

10dc9f5c

Support full URLs for model paths · 3ae1869a

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Accept full HTTPS URLs for --model (Vulkan/GGUF models)
- Accept full HTTPS URLs for --audio-model (faster-whisper models)
- Downloads file to temp directory before loading
- Shows download progress percentage

3ae1869a

Add --debug flag to dump full requests and replies · c12c55d6

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --debug CLI argument to enable debug mode
- When enabled, dumps full request body (no truncation)
- When enabled, dumps full generated text (no truncation)
- When enabled, dumps extracted tool calls in JSON format
- Useful for troubleshooting tool call issues

c12c55d6

Fix Pydantic deprecation warnings and Jinja2 crash · 910238ba

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Replace class-based Config with model_config = ConfigDict() in all Pydantic models
- Fix Jinja2 crash by ensuring all messages have content key that is never None
- Enhanced message cleaning in generate_chat and generate_chat_stream to create copies and ensure content is always a string
- Add final safety check in chat_completions endpoint for content handling

910238ba

Fix Jinja2 crash: ensure content key always exists in messages · f8618ce8

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add explicit check for missing content key in message dictionaries
- Use more aggressive regex patterns in strip_tool_calls_from_content
- Handle tool call tags in various formats (JSON, XML, tool names)
- Add checks in format_messages, _manual_format_messages, and chat_completions endpoint
- Fixes: 'dict object' has no attribute 'content' error in Jinja2 templates

f8618ce8

Fix Jinja2 error: ensure no message has None content in VulkanBackend · 4296b440

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Added safety check in generate_chat_stream to replace None content with empty string
- Added same check in generate_chat for consistency
- This prevents 'dict object has no attribute content' error when
  processing messages with tool_calls that have no text content

4296b440

feat: Add multi-model support for audio transcription and image generation · 1cdfe825

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add --audio-model and --image-model CLI arguments
- Add --loadall, --audio-ctx, --audio-offload, --vision-ctx, --vision-offload args
- Implement MultiModelManager class for dynamic model switching
- Add POST /v1/audio/transcriptions endpoint (OpenAI-compatible)
- Add POST /v1/images/generations endpoint (OpenAI-compatible)
- Update endpoints to use multi_model_manager for model selection
- Audio uses faster-whisper for local transcription
- Images use Stable Diffusion via diffusers

1cdfe825

Fix Jinja2 crash and tool call filtering · eb6b8d85

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Fix: Handle None content in messages to prevent Jinja2 'dict object has no attribute content' error
  - Added safety check in chat_completions function
  - Fixed _manual_format_messages to explicitly check for None
  - Fixed format_messages in VulkanBackend to ensure content is never None

- Fix: Always filter tool call format from output
  - Changed filter to run unconditionally (not just when tools are present)
  - Added extra regex patterns for JSON format tool calls like <tool>{...}</tool>

- Also fixed: Minor typos in comments (cket ->cket)

eb6b8d85

Fix tool parsing: deduplicate tool calls, strip raw format from streaming content · 886ea8f4

Stefy Lanza (nextime / spora ) authored Mar 08, 2026

- Add seen_signatures set to extract_tool_calls() to prevent duplicates
- Add strip_tool_calls_from_content() method to remove <tool>...</tool> tags
- Filter tool format from each chunk in real-time during streaming
- Simplify post-stream tool call handling since content is already cleaned
- Also handle non-streaming responses for tool call content cleanup

886ea8f4

Add CUDA build option for llama-cpp-python · 821e40dd
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

821e40dd
Create separate venv for each backend: venv_nvidia, venv_vulkan, venv_vulkan_nvidia · 58f4382d
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

58f4382d
Add vulkan-nvidia build option · 0b0a9798
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

0b0a9798
Debug Vulkan single GPU mode and add GGML_VULKAN_DEVICE env var · 6413d14f
Stefy Lanza (nextime / spora ) authored Mar 08, 2026

6413d14f

07 Mar, 2026 3 commits
- Detect chat template from model and use appropriate formatting - avoid Jinja... · 8d484ec2
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
```
Detect chat template from model and use appropriate formatting - avoid Jinja errors by using manual formatting when template detection fails
```
  8d484ec2
- Fix Jinja2 template error - properly handle multipart content arrays and tool_calls format · 576a6cfe
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  576a6cfe
- Fix Jinja2 template error in Vulkan backend - ensure all messages have content attribute · 08eee40c
  Stefy Lanza (nextime / spora ) authored Mar 07, 2026
  
  08eee40c
05 Mar, 2026 2 commits

Add fallback for models that don't support load_in_4bit quantization · e7e2c626

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

Modify _try_load_model() to catch TypeError when quantization arguments
are not supported by the model class. When this happens, the method now:
1. Warns the user about unsupported quantization
2. Retries loading the model without quantization arguments
3. Returns the model successfully if loading works

This fixes issues with models like Qwen3.5 that don't support
bitsandbytes quantization.

e7e2c626

Add OOM handling during generation to prevent crashes · 33a7e421

Stefy Lanza (nextime / spora ) authored Mar 05, 2026

- Wrap generate() with try-except to catch CUDA OOM errors
- On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing
- Wrap generate_stream() thread with error handling using shared variable
- Yield error messages to client instead of crashing the process
- Allows server to continue running after generation OOM

33a7e421