Commits · 39f8696e1e4bf164f269db790a7d425226058dbf · nexlab / coderai

16 Mar, 2026 35 commits

Implement LiteLLM integration for OpenAI-compatible /v1/chat/completions · 39f8696e

Your Name authored Mar 16, 2026

- Add litellm to requirements.txt
- Add --parser CLI arg (auto/litellm, default auto)
- Create codai/litellm_backend.py module with:
  - LiteLLMBackend class for standardized responses
  - Rate limit headers (x-ratelimit-remaining-tokens, x-ratelimit-limit-tokens)
  - Qwen tool-call resilience (parse <tool> and <tool_call> tags)
  - Error handling with litellm exception mapping
- Update chat completions endpoint to use litellm when --parser litellm
- Update codai/__init__.py to export litellm components

39f8696e

Add --parser CLI arg and litellm dependency for future integration · 7ec43f73

Your Name authored Mar 16, 2026

- Added litellm>=1.40.0 to requirements.txt
- Added --parser argument (auto/litellm, default auto)

Note: Full litellm integration requires significant refactoring of the
chat completion endpoints to use litellm.completion() for standardized
responses, adding rate limit headers, and error handling.

7ec43f73

Improve QwenParser with repetition guard and add repeat_penalty to API · 9e9febbd

Your Name authored Mar 16, 2026

QwenParser:
- Add repetition guard to handle looping models
- Improve flexible tag matching for tool/tool_call/function_call
- Add JSON recovery for unclosed JSON
- Add circuit breaker after first valid call
- Support <call=name> in coder style fallback

API:
- Add repeat_penalty parameter to ChatCompletionRequest
- Add repeat_penalty parameter to CompletionRequest

9e9febbd

Improve QwenParser with cleaner parsing logic and coder style fallback · 433eb3ee

Your Name authored Mar 16, 2026

- Added pre-cleaning for thinking/special tokens
- Unified tag matching for both <tool> and <tool_call>
- Added markdown code block stripping inside tags
- Added lazy JSON parsing fallback
- Added _parse_coder_style() and _relaxed_val() helper methods

433eb3ee

Update QwenParser with improved parsing and add _clean_json_string helper · 73d1c77c

Your Name authored Mar 16, 2026

- Added _clean_json_string() method to BaseParser for cleaning JSON strings
- Updated QwenParser.parse() with 3-step parsing strategy:
  1. Qwen format: <tool=func_name>...</tool>
  2. JSON format with flexible tag matching
  3. Fallback coder style with parameter tags
- Fixed syntax issues in the module

73d1c77c

Restructure: Move parser to codai.models, add templates, update imports · 0c504a0b
Your Name authored Mar 16, 2026

0c504a0b
Add Qwen format stripping in strip_tool_calls_from_content · 3fd46920
Your Name authored Mar 16, 2026

3fd46920
Force manual formatting when tools are present to avoid Jinja errors · 544896de
Your Name authored Mar 16, 2026

544896de
Add debug output to QwenParser · 562d5df5
Your Name authored Mar 16, 2026

562d5df5
Fix regex to handle </tool_call> closing tag · 861f8741
Your Name authored Mar 16, 2026

861f8741
Fix QwenParser to handle <tool=func_name> format · 36da321b
Your Name authored Mar 16, 2026

36da321b
Fix _to_oa to return OpenAI format with 'function' key · 886b4c2d
Your Name authored Mar 16, 2026

886b4c2d
Add debug output to model_parser showing model_name and selected parser · 55ebbc3b
Your Name authored Mar 16, 2026

55ebbc3b

Fix tool parsing error with improved error handling · 4ee69261

Your Name authored Mar 16, 2026

- Handle both dict and pydantic model formats for tools
- Add try/except around tool conversion and extraction
- More robust error handling to prevent 500 errors

4ee69261

Integrate model_parser module as codai package · 82ee7353

Your Name authored Mar 16, 2026

- Move model_parser.py into codai/ directory
- Add __init__.py to make it a proper Python module
- Create ModelParserAdapter class to wrap ModelParserDispatcher
- Replace ToolCallParser() with ModelParserAdapter() in 4 locations
- Update import to use 'from codai import ModelParserDispatcher'

This enables model-specific tool call parsing for Qwen, DeepSeek,
Llama, Mistral, Claude, Command R, Gemma, Grok, and Phi models.

82ee7353

Add Qwen model tool call parsing support · fbb6476e

Your Name authored Mar 16, 2026

- Add Qwen-specific tool call parsing in ToolCallParser
- Support for Instruct-style: <tool_call>{JSON}</tool_call>
- Support for Coder-style: <tool_call><function=name><parameter=k>v</parameter></function></tool_call>
- Add model_name attribute to ToolCallParser for model-specific parsing
- Update ModelManager.load_model to set model name on tool parser
- Fix duplicate method definitions in ToolCallParser class

fbb6476e

Add response_format support for JSON output · 0ce79fb9

Your Name authored Mar 16, 2026

- Pass response_format to llama.cpp create_chat_completion
- Supports {'type': 'json_object'} for JSON output mode
- Applied to both streaming and non-streaming responses

0ce79fb9

Fix: Use GGUF embedded template with tools via llama.cpp · 05e6d145

Your Name authored Mar 16, 2026

When chat_template is 'default' (embedded GGUF template), allow llama.cpp
to handle tool messages via create_chat_completion instead of forcing
manual formatting. This allows the GGUF's native template to be used
even with tools.

05e6d145

Fix: Use GGUF embedded chat template via llama.cpp instead of manual formatting · 280145ee

Your Name authored Mar 16, 2026

When chat_template is 'default', it means llama.cpp detected an embedded
template in the GGUF model. Don't fall back to manual formatting - instead
let llama.cpp's create_chat_completion use its internal template handling.

280145ee

Add missing OpenAI API response fields for better compatibility · 90abc6a8

Your Name authored Mar 16, 2026

- Add provider object with provider_name and provider_id
- Add system_fingerprint (null)
- Add logprobs in choices (null)
- Add native_finish_reason in choices
- Add usage.prompt_tokens_details with cached_tokens and audio_tokens
- Add usage.completion_tokens_details with reasoning_tokens and audio_tokens
- Apply to both streaming and non-streaming responses

90abc6a8

Fix known template fallback and use_manual condition for GGUF models · fb8ec881

Your Name authored Mar 16, 2026

- Directly set chat_template to known template names (qwen3, qwen, llama3, etc.)
  instead of trying to load non-existent HuggingFace tokenizers
- Add use_manual condition to use manual formatting when chat_template is set
  but hf_tokenizer is None (applies to both generate_chat and generate_chat_stream)
- This ensures GGUF models loaded from URLs with known templates use proper
  <|im_start|> formatting instead of failing on create_chat_completion

fb8ec881

Add fallback to try known chat template names when tokenizer loading fails · 8cc1af10

Your Name authored Mar 16, 2026

When HF tokenizer loading fails, try known template names based on model name:
- Qwen models: try qwen3, qwen templates
- Llama models: try llama3, llama templates
- Phi models: try phi template
- Mistral models: try mistral template

This helps when the tokenizer can't be loaded but we know the model family.

8cc1af10

Fix: Initialize model_backend_types in MultiModelManager.__init__ · cd877dc3

Your Name authored Mar 16, 2026

The model_backend_types attribute was not being initialized properly due to
incorrect indentation, causing 'MultiModelManager' object has no attribute
'model_backend_types' error when trying to load models on-demand.

cd877dc3

Add fallback for HuggingFace tokenizer loading with progressive model name shorter variants · 3479b3f0

Your Name authored Mar 16, 2026

- Add uppercase quantization suffixes (_Q4_K_M, etc.) to handle cached GGUF filenames
- Add progressive fallback to try shorter model names when tokenizer loading fails
- Example: Qwen3.5-27B-Uncensored-HauhauCS-Aggressive -> try Qwen3.5-27B-Uncensored -> Qwen3.5-27B -> Qwen3.5 -> Qwen
- Add warning when all tokenizer loading attempts fail (will use manual formatting instead)

3479b3f0

Fix: Hash prefix is 64 chars (SHA-256), add fallback for model_backend_types · 43cb91d5
Your Name authored Mar 16, 2026

43cb91d5
Fix: Remove hash prefix from cached GGUF filenames properly · 2da217a1
Your Name authored Mar 16, 2026

2da217a1
Fix: Remove hash prefix from cached GGUF filenames when extracting model name · 34648b9b
Your Name authored Mar 16, 2026

34648b9b
Fix HF tokenizer loading to check for cached local file first when model is URL · 13e81d0d
Your Name authored Mar 16, 2026

13e81d0d
Fix: Add _aggressive_vram_cleanup to MultiModelManager class · b4d3d43b
Your Name authored Mar 16, 2026

b4d3d43b
Reduce VRAM cleanup delay to 2 seconds · 6f42fbde
Your Name authored Mar 16, 2026

6f42fbde

Add aggressive VRAM cleanup for model switching · 7c150a4d

Your Name authored Mar 16, 2026

- Added _aggressive_vram_cleanup method to properly clear VRAM
- Moves model to CPU before deletion
- Deletes pipeline, vae, text_encoder, tokenizer explicitly
- Multiple rounds of gc.collect()
- Uses torch.cuda.synchronize() before clearing cache
- Increased delay to 5 seconds after cleanup

7c150a4d

Improve --hf-chat-template help text · 804dac03
Your Name authored Mar 16, 2026

804dac03

Add support for specifying chat template in --hf-chat-template · 6e794ae6

Your Name authored Mar 16, 2026

- Now can specify template directly: --hf-chat-template "model:template"
- Updated check_hf_chat_template to return tuple (should_use, template_name)
- Updated _load_huggingface_tokenizer to accept template_name parameter
- Updated README with new syntax and template examples

6e794ae6

Add auto-detect support for --hf-chat-template · e17bc553

Your Name authored Mar 16, 2026

- Added 'auto' as a valid value for --hf-chat-template
- When --hf-chat-template auto is used, it auto-detects and applies HF template to all models
- Updated README with new syntax

e17bc553

Add debug output showing raw vs escaped content · 079bd8dc
Your Name authored Mar 16, 2026

079bd8dc

15 Mar, 2026 5 commits

Make --hf-chat-template repeatable per model · 31b6480e

Your Name authored Mar 15, 2026

- Changed --hf-chat-template from boolean to action=append
- Added check_hf_chat_template() function for model-specific checking
- Updated _finalize_chat_template_detection to use new function
- Updated README with new syntax

31b6480e

Add --hf-chat-template option for HuggingFace apply_chat_template · 3121fb85

Your Name authored Mar 15, 2026

- Added --hf-chat-template CLI flag to use transformers apply_chat_template
- Added _load_huggingface_tokenizer() to load HF tokenizer for GGUF models
- Added _format_messages_hf() method for HF chat template formatting
- Updated generate_chat and generate_chat_stream to use HF tokenizer when available
- Updated format_messages to check for HF tokenizer first
- Added documentation in README.md

3121fb85

Add --reply-filters option for optional content filtering · 533f8fd5

Your Name authored Mar 15, 2026

- Added --reply-filters CLI flag to make content filtering optional
- Supports comma-separated values: --reply-filters malformed,tool_calls
- Supports model-specific filters: --reply-filters text:malformed --reply-filters image:tool_calls
- Supports specific model names: --reply-filters text:llama-3.1:malformed
- Added check_reply_filter() and check_single_filter() helper functions
- Updated stream_chat_response and generate_chat_response to use new filtering
- Updated ToolCallParser._filter_malformed_content for conditional filtering
- Added documentation in README.md

533f8fd5

Increase VRAM cleanup delay to 3 seconds · f8d2481e
Your Name authored Mar 15, 2026
```
- Give more time for Vulkan memory to be freed after unloading image models
```
f8d2481e

Fix: Force VRAM cleanup when switching from image to text model · a4a8c340

Your Name authored Mar 15, 2026

- Add garbage collection and torch.cuda.empty_cache() after unloading image models
- Add a small delay to allow VRAM to be freed before loading new model
- This should help prevent OOM errors when switching between image and text models

a4a8c340