Commits · 3fd46920339a5d6085568fa179f6217b03889558 · nexlab / coderai

16 Mar, 2026 29 commits

Add Qwen format stripping in strip_tool_calls_from_content · 3fd46920
Your Name authored Mar 16, 2026

3fd46920
Force manual formatting when tools are present to avoid Jinja errors · 544896de
Your Name authored Mar 16, 2026

544896de
Add debug output to QwenParser · 562d5df5
Your Name authored Mar 16, 2026

562d5df5
Fix regex to handle </tool_call> closing tag · 861f8741
Your Name authored Mar 16, 2026

861f8741
Fix QwenParser to handle <tool=func_name> format · 36da321b
Your Name authored Mar 16, 2026

36da321b
Fix _to_oa to return OpenAI format with 'function' key · 886b4c2d
Your Name authored Mar 16, 2026

886b4c2d
Add debug output to model_parser showing model_name and selected parser · 55ebbc3b
Your Name authored Mar 16, 2026

55ebbc3b

Fix tool parsing error with improved error handling · 4ee69261

Your Name authored Mar 16, 2026

- Handle both dict and pydantic model formats for tools
- Add try/except around tool conversion and extraction
- More robust error handling to prevent 500 errors

4ee69261

Integrate model_parser module as codai package · 82ee7353

Your Name authored Mar 16, 2026

- Move model_parser.py into codai/ directory
- Add __init__.py to make it a proper Python module
- Create ModelParserAdapter class to wrap ModelParserDispatcher
- Replace ToolCallParser() with ModelParserAdapter() in 4 locations
- Update import to use 'from codai import ModelParserDispatcher'

This enables model-specific tool call parsing for Qwen, DeepSeek,
Llama, Mistral, Claude, Command R, Gemma, Grok, and Phi models.

82ee7353

Add Qwen model tool call parsing support · fbb6476e

Your Name authored Mar 16, 2026

- Add Qwen-specific tool call parsing in ToolCallParser
- Support for Instruct-style: <tool_call>{JSON}</tool_call>
- Support for Coder-style: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>
- Add model_name attribute to ToolCallParser for model-specific parsing
- Update ModelManager.load_model to set model name on tool parser
- Fix duplicate method definitions in ToolCallParser class

fbb6476e

Add response_format support for JSON output · 0ce79fb9

Your Name authored Mar 16, 2026

- Pass response_format to llama.cpp create_chat_completion
- Supports {'type': 'json_object'} for JSON output mode
- Applied to both streaming and non-streaming responses

0ce79fb9

Fix: Use GGUF embedded template with tools via llama.cpp · 05e6d145

Your Name authored Mar 16, 2026

When chat_template is 'default' (embedded GGUF template), allow llama.cpp
to handle tool messages via create_chat_completion instead of forcing
manual formatting. This allows the GGUF's native template to be used
even with tools.

05e6d145

Fix: Use GGUF embedded chat template via llama.cpp instead of manual formatting · 280145ee

Your Name authored Mar 16, 2026

When chat_template is 'default', it means llama.cpp detected an embedded
template in the GGUF model. Don't fall back to manual formatting - instead
let llama.cpp's create_chat_completion use its internal template handling.

280145ee

Add missing OpenAI API response fields for better compatibility · 90abc6a8

Your Name authored Mar 16, 2026

- Add provider object with provider_name and provider_id
- Add system_fingerprint (null)
- Add logprobs in choices (null)
- Add native_finish_reason in choices
- Add usage.prompt_tokens_details with cached_tokens and audio_tokens
- Add usage.completion_tokens_details with reasoning_tokens and audio_tokens
- Apply to both streaming and non-streaming responses

90abc6a8

Fix known template fallback and use_manual condition for GGUF models · fb8ec881

Your Name authored Mar 16, 2026

- Directly set chat_template to known template names (qwen3, qwen, llama3, etc.)
  instead of trying to load non-existent HuggingFace tokenizers
- Add use_manual condition to use manual formatting when chat_template is set
  but hf_tokenizer is None (applies to both generate_chat and generate_chat_stream)
- This ensures GGUF models loaded from URLs with known templates use proper
  <|im_start|> formatting instead of failing on create_chat_completion

fb8ec881

Add fallback to try known chat template names when tokenizer loading fails · 8cc1af10

Your Name authored Mar 16, 2026

When HF tokenizer loading fails, try known template names based on model name:
- Qwen models: try qwen3, qwen templates
- Llama models: try llama3, llama templates
- Phi models: try phi template
- Mistral models: try mistral template

This helps when the tokenizer can't be loaded but we know the model family.

8cc1af10

Fix: Initialize model_backend_types in MultiModelManager.__init__ · cd877dc3

Your Name authored Mar 16, 2026

The model_backend_types attribute was not being initialized properly due to
incorrect indentation, causing 'MultiModelManager' object has no attribute
'model_backend_types' error when trying to load models on-demand.

cd877dc3

Add fallback for HuggingFace tokenizer loading with progressive model name shorter variants · 3479b3f0

Your Name authored Mar 16, 2026

- Add uppercase quantization suffixes (_Q4_K_M, etc.) to handle cached GGUF filenames
- Add progressive fallback to try shorter model names when tokenizer loading fails
- Example: Qwen3.5-27B-Uncensored-HauhauCS-Aggressive -> try Qwen3.5-27B-Uncensored -> Qwen3.5-27B -> Qwen3.5 -> Qwen
- Add warning when all tokenizer loading attempts fail (will use manual formatting instead)

3479b3f0

Fix: Hash prefix is 64 chars (SHA-256), add fallback for model_backend_types · 43cb91d5
Your Name authored Mar 16, 2026

43cb91d5
Fix: Remove hash prefix from cached GGUF filenames properly · 2da217a1
Your Name authored Mar 16, 2026

2da217a1
Fix: Remove hash prefix from cached GGUF filenames when extracting model name · 34648b9b
Your Name authored Mar 16, 2026

34648b9b
Fix HF tokenizer loading to check for cached local file first when model is URL · 13e81d0d
Your Name authored Mar 16, 2026

13e81d0d
Fix: Add _aggressive_vram_cleanup to MultiModelManager class · b4d3d43b
Your Name authored Mar 16, 2026

b4d3d43b
Reduce VRAM cleanup delay to 2 seconds · 6f42fbde
Your Name authored Mar 16, 2026

6f42fbde

Add aggressive VRAM cleanup for model switching · 7c150a4d

Your Name authored Mar 16, 2026

- Added _aggressive_vram_cleanup method to properly clear VRAM
- Moves model to CPU before deletion
- Deletes pipeline, vae, text_encoder, tokenizer explicitly
- Multiple rounds of gc.collect()
- Uses torch.cuda.synchronize() before clearing cache
- Increased delay to 5 seconds after cleanup

7c150a4d

Improve --hf-chat-template help text · 804dac03
Your Name authored Mar 16, 2026

804dac03

Add support for specifying chat template in --hf-chat-template · 6e794ae6

Your Name authored Mar 16, 2026

- Now can specify template directly: --hf-chat-template "model:template"
- Updated check_hf_chat_template to return tuple (should_use, template_name)
- Updated _load_huggingface_tokenizer to accept template_name parameter
- Updated README with new syntax and template examples

6e794ae6

Add auto-detect support for --hf-chat-template · e17bc553

Your Name authored Mar 16, 2026

- Added 'auto' as a valid value for --hf-chat-template
- When --hf-chat-template auto is used, it auto-detects and applies HF template to all models
- Updated README with new syntax

e17bc553

Add debug output showing raw vs escaped content · 079bd8dc
Your Name authored Mar 16, 2026

079bd8dc

15 Mar, 2026 11 commits

Make --hf-chat-template repeatable per model · 31b6480e

Your Name authored Mar 15, 2026

- Changed --hf-chat-template from boolean to action=append
- Added check_hf_chat_template() function for model-specific checking
- Updated _finalize_chat_template_detection to use new function
- Updated README with new syntax

31b6480e

Add --hf-chat-template option for HuggingFace apply_chat_template · 3121fb85

Your Name authored Mar 15, 2026

- Added --hf-chat-template CLI flag to use transformers apply_chat_template
- Added _load_huggingface_tokenizer() to load HF tokenizer for GGUF models
- Added _format_messages_hf() method for HF chat template formatting
- Updated generate_chat and generate_chat_stream to use HF tokenizer when available
- Updated format_messages to check for HF tokenizer first
- Added documentation in README.md

3121fb85

Add --reply-filters option for optional content filtering · 533f8fd5

Your Name authored Mar 15, 2026

- Added --reply-filters CLI flag to make content filtering optional
- Supports comma-separated values: --reply-filters malformed,tool_calls
- Supports model-specific filters: --reply-filters text:malformed --reply-filters image:tool_calls
- Supports specific model names: --reply-filters text:llama-3.1:malformed
- Added check_reply_filter() and check_single_filter() helper functions
- Updated stream_chat_response and generate_chat_response to use new filtering
- Updated ToolCallParser._filter_malformed_content for conditional filtering
- Added documentation in README.md

533f8fd5

Increase VRAM cleanup delay to 3 seconds · f8d2481e
Your Name authored Mar 15, 2026
```
- Give more time for Vulkan memory to be freed after unloading image models
```
f8d2481e

Fix: Force VRAM cleanup when switching from image to text model · a4a8c340

Your Name authored Mar 15, 2026

- Add garbage collection and torch.cuda.empty_cache() after unloading image models
- Add a small delay to allow VRAM to be freed before loading new model
- This should help prevent OOM errors when switching between image and text models

a4a8c340

Fix: Don't strip whitespace from model output · 13b56ea0

Your Name authored Mar 15, 2026

- Remove stripping in strip_tool_calls_from_content function
- Whitespace (spaces, newlines) are valid content and should be preserved

13b56ea0

Fix: Use cached file path when reloading URL-based models · 7e7930f5

Your Name authored Mar 15, 2026

- When reloading a default model that was loaded from a URL,
  check for cached file path and use it instead of the URL

7e7930f5

Fix: Handle NaN values in diffusers image output · dd924c1a

Your Name authored Mar 15, 2026

- Replace NaN and Inf values with valid values before saving
- Clip image values to valid range [0, 1] to prevent black images

dd924c1a

Fix: Reload default text model when switching from image to text · c517c947

Your Name authored Mar 15, 2026

- When 'default' model is requested but not loaded (was unloaded for image model),
  the code now tries to reload the default model
- Cleanup image models first to free VRAM, then reload the text model

c517c947

Add --image-cpu-offload option and fix sequential offload logic · c8f7c8d9

Your Name authored Mar 15, 2026

- Add --image-cpu-offload CLI flag for explicit sequential CPU offload
- Enable sequential CPU offload only on 3rd OOM retry or when --image-cpu-offload is set

c8f7c8d9

Fix: Filter URLs from default model listing · ac005426

Your Name authored Mar 15, 2026

- Skip URLs when listing the default model in list_models()
- This prevents download URLs from appearing in available models list

ac005426