- 08 Mar, 2026 26 commits
-
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
When audio model is in GGUF format, use llama.cpp instead of faster-whisper for pre-loading. This allows using Vulkan backend for audio transcription.
-
Stefy Lanza (nextime / spora ) authored
When only one model type is specified (e.g., only --audio-model with no --model), automatically pre-load it even in on-demand mode. This ensures the model is downloaded and ready for use.
-
Stefy Lanza (nextime / spora ) authored
- Add --loadall flag to pre-load all models at startup - Add --loadswap flag to keep models in RAM, swap active to VRAM - Fix bug where load_mode was used before being defined in audio model section - Remove duplicate load_mode determination code - Improve error message for no main model specified to include TTS
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
- Add --tts-model option for Kokoro TTS models - Add /v1/audio/speech endpoint (OpenAI-compatible) - Add model caching to prevent redundant downloads - Replace MD5 with SHA-256 for cache keys - Move hashlib and pathlib imports to module level
-
Stefy Lanza (nextime / spora ) authored
- --model is now optional if using audio or image models only - Shows helpful error message with examples if no model specified - Prints available models at startup
-
Stefy Lanza (nextime / spora ) authored
- Accept full HTTPS URLs for --model (Vulkan/GGUF models) - Accept full HTTPS URLs for --audio-model (faster-whisper models) - Downloads file to temp directory before loading - Shows download progress percentage
-
Stefy Lanza (nextime / spora ) authored
- Add --debug CLI argument to enable debug mode - When enabled, dumps full request body (no truncation) - When enabled, dumps full generated text (no truncation) - When enabled, dumps extracted tool calls in JSON format - Useful for troubleshooting tool call issues
-
Stefy Lanza (nextime / spora ) authored
- Replace class-based Config with model_config = ConfigDict() in all Pydantic models - Fix Jinja2 crash by ensuring all messages have content key that is never None - Enhanced message cleaning in generate_chat and generate_chat_stream to create copies and ensure content is always a string - Add final safety check in chat_completions endpoint for content handling
-
Stefy Lanza (nextime / spora ) authored
- Add explicit check for missing content key in message dictionaries - Use more aggressive regex patterns in strip_tool_calls_from_content - Handle tool call tags in various formats (JSON, XML, tool names) - Add checks in format_messages, _manual_format_messages, and chat_completions endpoint - Fixes: 'dict object' has no attribute 'content' error in Jinja2 templates
-
Stefy Lanza (nextime / spora ) authored
- Added safety check in generate_chat_stream to replace None content with empty string - Added same check in generate_chat for consistency - This prevents 'dict object has no attribute content' error when processing messages with tool_calls that have no text content
-
Stefy Lanza (nextime / spora ) authored
- Add --audio-model and --image-model CLI arguments - Add --loadall, --audio-ctx, --audio-offload, --vision-ctx, --vision-offload args - Implement MultiModelManager class for dynamic model switching - Add POST /v1/audio/transcriptions endpoint (OpenAI-compatible) - Add POST /v1/images/generations endpoint (OpenAI-compatible) - Update endpoints to use multi_model_manager for model selection - Audio uses faster-whisper for local transcription - Images use Stable Diffusion via diffusers
-
Stefy Lanza (nextime / spora ) authored
- Fix: Handle None content in messages to prevent Jinja2 'dict object has no attribute content' error - Added safety check in chat_completions function - Fixed _manual_format_messages to explicitly check for None - Fixed format_messages in VulkanBackend to ensure content is never None - Fix: Always filter tool call format from output - Changed filter to run unconditionally (not just when tools are present) - Added extra regex patterns for JSON format tool calls like <tool>{...}</tool> - Also fixed: Minor typos in comments (cket ->cket) -
Stefy Lanza (nextime / spora ) authored
- Add seen_signatures set to extract_tool_calls() to prevent duplicates - Add strip_tool_calls_from_content() method to remove <tool>...</tool> tags - Filter tool format from each chunk in real-time during streaming - Simplify post-stream tool call handling since content is already cleaned - Also handle non-streaming responses for tool call content cleanup
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
- 07 Mar, 2026 3 commits
-
-
Stefy Lanza (nextime / spora ) authored
Detect chat template from model and use appropriate formatting - avoid Jinja errors by using manual formatting when template detection fails
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
- 05 Mar, 2026 3 commits
-
-
Stefy Lanza (nextime / spora ) authored
Modify _try_load_model() to catch TypeError when quantization arguments are not supported by the model class. When this happens, the method now: 1. Warns the user about unsupported quantization 2. Retries loading the model without quantization arguments 3. Returns the model successfully if loading works This fixes issues with models like Qwen3.5 that don't support bitsandbytes quantization.
-
Stefy Lanza (nextime / spora ) authored
- Wrap generate() with try-except to catch CUDA OOM errors - On OOM: clear CUDA cache, retry with half tokens, return graceful error if still failing - Wrap generate_stream() thread with error handling using shared variable - Yield error messages to client instead of crashing the process - Allows server to continue running after generation OOM
-
Stefy Lanza (nextime / spora ) authored
This new parameter allows users to specify the exact percentage of GPU VRAM to use, overriding the offload-strategy. When specified, the model will: 1. Use up to max-gpu-percent of VRAM 2. Offload remaining weights to CPU RAM (--ram) 3. Overflow to disk (--offload-dir) if RAM exhausted 4. Automatically fallback in 5% steps if OOM occurs Example usage for RTX 3090 with Qwen3.5-35B-A3B: coderai --model Qwen/Qwen3.5-35B-A3B --max-gpu-percent 50 --ram 64 This ensures MoE models with high VRAM requirements during generation can run without OOM by using CPU RAM as the primary offload target.
-
- 01 Mar, 2026 8 commits
-
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-