Commits · fbb6476e4bbd7e6cca6cf280c53b5a8f990e2f6b · nexlab / coderai

16 Mar, 2026 20 commits

Add Qwen model tool call parsing support · fbb6476e

Your Name authored Mar 16, 2026

- Add Qwen-specific tool call parsing in ToolCallParser
- Support for Instruct-style: <tool_call>{JSON}</tool_call>
- Support for Coder-style: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>
- Add model_name attribute to ToolCallParser for model-specific parsing
- Update ModelManager.load_model to set model name on tool parser
- Fix duplicate method definitions in ToolCallParser class

fbb6476e

Add response_format support for JSON output · 0ce79fb9

Your Name authored Mar 16, 2026

- Pass response_format to llama.cpp create_chat_completion
- Supports {'type': 'json_object'} for JSON output mode
- Applied to both streaming and non-streaming responses

0ce79fb9

Fix: Use GGUF embedded template with tools via llama.cpp · 05e6d145

Your Name authored Mar 16, 2026

When chat_template is 'default' (embedded GGUF template), allow llama.cpp
to handle tool messages via create_chat_completion instead of forcing
manual formatting. This allows the GGUF's native template to be used
even with tools.

05e6d145

Fix: Use GGUF embedded chat template via llama.cpp instead of manual formatting · 280145ee

Your Name authored Mar 16, 2026

When chat_template is 'default', it means llama.cpp detected an embedded
template in the GGUF model. Don't fall back to manual formatting - instead
let llama.cpp's create_chat_completion use its internal template handling.

280145ee

Add missing OpenAI API response fields for better compatibility · 90abc6a8

Your Name authored Mar 16, 2026

- Add provider object with provider_name and provider_id
- Add system_fingerprint (null)
- Add logprobs in choices (null)
- Add native_finish_reason in choices
- Add usage.prompt_tokens_details with cached_tokens and audio_tokens
- Add usage.completion_tokens_details with reasoning_tokens and audio_tokens
- Apply to both streaming and non-streaming responses

90abc6a8

Fix known template fallback and use_manual condition for GGUF models · fb8ec881

Your Name authored Mar 16, 2026

- Directly set chat_template to known template names (qwen3, qwen, llama3, etc.)
  instead of trying to load non-existent HuggingFace tokenizers
- Add use_manual condition to use manual formatting when chat_template is set
  but hf_tokenizer is None (applies to both generate_chat and generate_chat_stream)
- This ensures GGUF models loaded from URLs with known templates use proper
  <|im_start|> formatting instead of failing on create_chat_completion

fb8ec881

Add fallback to try known chat template names when tokenizer loading fails · 8cc1af10

Your Name authored Mar 16, 2026

When HF tokenizer loading fails, try known template names based on model name:
- Qwen models: try qwen3, qwen templates
- Llama models: try llama3, llama templates
- Phi models: try phi template
- Mistral models: try mistral template

This helps when the tokenizer can't be loaded but we know the model family.

8cc1af10

Fix: Initialize model_backend_types in MultiModelManager.__init__ · cd877dc3

Your Name authored Mar 16, 2026

The model_backend_types attribute was not being initialized properly due to
incorrect indentation, causing 'MultiModelManager' object has no attribute
'model_backend_types' error when trying to load models on-demand.

cd877dc3

Add fallback for HuggingFace tokenizer loading with progressive model name shorter variants · 3479b3f0

Your Name authored Mar 16, 2026

- Add uppercase quantization suffixes (_Q4_K_M, etc.) to handle cached GGUF filenames
- Add progressive fallback to try shorter model names when tokenizer loading fails
- Example: Qwen3.5-27B-Uncensored-HauhauCS-Aggressive -> try Qwen3.5-27B-Uncensored -> Qwen3.5-27B -> Qwen3.5 -> Qwen
- Add warning when all tokenizer loading attempts fail (will use manual formatting instead)

3479b3f0

Fix: Hash prefix is 64 chars (SHA-256), add fallback for model_backend_types · 43cb91d5
Your Name authored Mar 16, 2026

43cb91d5
Fix: Remove hash prefix from cached GGUF filenames properly · 2da217a1
Your Name authored Mar 16, 2026

2da217a1
Fix: Remove hash prefix from cached GGUF filenames when extracting model name · 34648b9b
Your Name authored Mar 16, 2026

34648b9b
Fix HF tokenizer loading to check for cached local file first when model is URL · 13e81d0d
Your Name authored Mar 16, 2026

13e81d0d
Fix: Add _aggressive_vram_cleanup to MultiModelManager class · b4d3d43b
Your Name authored Mar 16, 2026

b4d3d43b
Reduce VRAM cleanup delay to 2 seconds · 6f42fbde
Your Name authored Mar 16, 2026

6f42fbde

Add aggressive VRAM cleanup for model switching · 7c150a4d

Your Name authored Mar 16, 2026

- Added _aggressive_vram_cleanup method to properly clear VRAM
- Moves model to CPU before deletion
- Deletes pipeline, vae, text_encoder, tokenizer explicitly
- Multiple rounds of gc.collect()
- Uses torch.cuda.synchronize() before clearing cache
- Increased delay to 5 seconds after cleanup

7c150a4d

Improve --hf-chat-template help text · 804dac03
Your Name authored Mar 16, 2026

804dac03

Add support for specifying chat template in --hf-chat-template · 6e794ae6

Your Name authored Mar 16, 2026

- Now can specify template directly: --hf-chat-template "model:template"
- Updated check_hf_chat_template to return tuple (should_use, template_name)
- Updated _load_huggingface_tokenizer to accept template_name parameter
- Updated README with new syntax and template examples

6e794ae6

Add auto-detect support for --hf-chat-template · e17bc553

Your Name authored Mar 16, 2026

- Added 'auto' as a valid value for --hf-chat-template
- When --hf-chat-template auto is used, it auto-detects and applies HF template to all models
- Updated README with new syntax

e17bc553

Add debug output showing raw vs escaped content · 079bd8dc
Your Name authored Mar 16, 2026

079bd8dc

15 Mar, 2026 20 commits

Make --hf-chat-template repeatable per model · 31b6480e