- 16 Mar, 2026 29 commits
-
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
- Handle both dict and pydantic model formats for tools - Add try/except around tool conversion and extraction - More robust error handling to prevent 500 errors
-
Your Name authored
- Move model_parser.py into codai/ directory - Add __init__.py to make it a proper Python module - Create ModelParserAdapter class to wrap ModelParserDispatcher - Replace ToolCallParser() with ModelParserAdapter() in 4 locations - Update import to use 'from codai import ModelParserDispatcher' This enables model-specific tool call parsing for Qwen, DeepSeek, Llama, Mistral, Claude, Command R, Gemma, Grok, and Phi models.
-
Your Name authored
- Add Qwen-specific tool call parsing in ToolCallParser - Support for Instruct-style: <tool_call>{JSON}</tool_call> - Support for Coder-style: <tool_call><function=name><parameter=k>v</parameter></function></tool_call> - Add model_name attribute to ToolCallParser for model-specific parsing - Update ModelManager.load_model to set model name on tool parser - Fix duplicate method definitions in ToolCallParser class -
Your Name authored
- Pass response_format to llama.cpp create_chat_completion - Supports {'type': 'json_object'} for JSON output mode - Applied to both streaming and non-streaming responses -
Your Name authored
When chat_template is 'default' (embedded GGUF template), allow llama.cpp to handle tool messages via create_chat_completion instead of forcing manual formatting. This allows the GGUF's native template to be used even with tools.
-
Your Name authored
When chat_template is 'default', it means llama.cpp detected an embedded template in the GGUF model. Don't fall back to manual formatting - instead let llama.cpp's create_chat_completion use its internal template handling.
-
Your Name authored
- Add provider object with provider_name and provider_id - Add system_fingerprint (null) - Add logprobs in choices (null) - Add native_finish_reason in choices - Add usage.prompt_tokens_details with cached_tokens and audio_tokens - Add usage.completion_tokens_details with reasoning_tokens and audio_tokens - Apply to both streaming and non-streaming responses
-
Your Name authored
- Directly set chat_template to known template names (qwen3, qwen, llama3, etc.) instead of trying to load non-existent HuggingFace tokenizers - Add use_manual condition to use manual formatting when chat_template is set but hf_tokenizer is None (applies to both generate_chat and generate_chat_stream) - This ensures GGUF models loaded from URLs with known templates use proper <|im_start|> formatting instead of failing on create_chat_completion
-
Your Name authored
When HF tokenizer loading fails, try known template names based on model name: - Qwen models: try qwen3, qwen templates - Llama models: try llama3, llama templates - Phi models: try phi template - Mistral models: try mistral template This helps when the tokenizer can't be loaded but we know the model family.
-
Your Name authored
The model_backend_types attribute was not being initialized properly due to incorrect indentation, causing 'MultiModelManager' object has no attribute 'model_backend_types' error when trying to load models on-demand.
-
Your Name authored
- Add uppercase quantization suffixes (_Q4_K_M, etc.) to handle cached GGUF filenames - Add progressive fallback to try shorter model names when tokenizer loading fails - Example: Qwen3.5-27B-Uncensored-HauhauCS-Aggressive -> try Qwen3.5-27B-Uncensored -> Qwen3.5-27B -> Qwen3.5 -> Qwen - Add warning when all tokenizer loading attempts fail (will use manual formatting instead)
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
- Added _aggressive_vram_cleanup method to properly clear VRAM - Moves model to CPU before deletion - Deletes pipeline, vae, text_encoder, tokenizer explicitly - Multiple rounds of gc.collect() - Uses torch.cuda.synchronize() before clearing cache - Increased delay to 5 seconds after cleanup
-
Your Name authored
-
Your Name authored
- Now can specify template directly: --hf-chat-template "model:template" - Updated check_hf_chat_template to return tuple (should_use, template_name) - Updated _load_huggingface_tokenizer to accept template_name parameter - Updated README with new syntax and template examples
-
Your Name authored
- Added 'auto' as a valid value for --hf-chat-template - When --hf-chat-template auto is used, it auto-detects and applies HF template to all models - Updated README with new syntax
-
Your Name authored
-
- 15 Mar, 2026 11 commits
-
-
Your Name authored
- Changed --hf-chat-template from boolean to action=append - Added check_hf_chat_template() function for model-specific checking - Updated _finalize_chat_template_detection to use new function - Updated README with new syntax
-
Your Name authored
- Added --hf-chat-template CLI flag to use transformers apply_chat_template - Added _load_huggingface_tokenizer() to load HF tokenizer for GGUF models - Added _format_messages_hf() method for HF chat template formatting - Updated generate_chat and generate_chat_stream to use HF tokenizer when available - Updated format_messages to check for HF tokenizer first - Added documentation in README.md
-
Your Name authored
- Added --reply-filters CLI flag to make content filtering optional - Supports comma-separated values: --reply-filters malformed,tool_calls - Supports model-specific filters: --reply-filters text:malformed --reply-filters image:tool_calls - Supports specific model names: --reply-filters text:llama-3.1:malformed - Added check_reply_filter() and check_single_filter() helper functions - Updated stream_chat_response and generate_chat_response to use new filtering - Updated ToolCallParser._filter_malformed_content for conditional filtering - Added documentation in README.md
-
Your Name authored
- Give more time for Vulkan memory to be freed after unloading image models
-
Your Name authored
- Add garbage collection and torch.cuda.empty_cache() after unloading image models - Add a small delay to allow VRAM to be freed before loading new model - This should help prevent OOM errors when switching between image and text models
-
Your Name authored
- Remove stripping in strip_tool_calls_from_content function - Whitespace (spaces, newlines) are valid content and should be preserved
-
Your Name authored
- When reloading a default model that was loaded from a URL, check for cached file path and use it instead of the URL
-
Your Name authored
- Replace NaN and Inf values with valid values before saving - Clip image values to valid range [0, 1] to prevent black images
-
Your Name authored
- When 'default' model is requested but not loaded (was unloaded for image model), the code now tries to reload the default model - Cleanup image models first to free VRAM, then reload the text model
-
Your Name authored
- Add --image-cpu-offload CLI flag for explicit sequential CPU offload - Enable sequential CPU offload only on 3rd OOM retry or when --image-cpu-offload is set
-
Your Name authored
- Skip URLs when listing the default model in list_models() - This prevents download URLs from appearing in available models list
-