- 17 Mar, 2026 26 commits
-
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now accepts positional args: max_tokens, temperature, top_p, stop
-
Your Name authored
Convert ChatMessage objects to dicts before applying chat template.
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now detects and uses the built-in chat template from GGUF files loaded via llama-cpp-python before falling back to manual formatting.
-
Your Name authored
-
Your Name authored
Now detects GGUF model repos (e.g., unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) and lists available GGUF files before downloading. Prefers Q4_K_M or Q4_K quantizations when available.
-
Your Name authored
Fixed load_model and generate to be non-async methods (matching base class): - load_model: changed from async def returning bool to def returning None - generate: changed from async def to def (removed streaming support in sync version) - Removed 'stream' parameter from generate since it's now sync - chat: changed from async def to def - generate_stream remains async def (correct for streaming)
-
Your Name authored
Added: - get_model_name() - format_messages() - cleanup() These were required by the ModelBackend abstract base class.
-
Your Name authored
Removed ~2050 lines of duplicate code: - Pydantic models (ToolFunction, Tool, ChatMessage, etc.) - now from codai.pydantic - ModelParserAdapter, ToolCallParser - now from codai.models - NvidiaBackend, VulkanBackend - now from codai.backends - All other duplicates removed Now coderai properly imports all classes from codai modules.
-
Your Name authored
Removed ~1500 lines of duplicate code that now exist in codai modules: - ModelCapabilities, detect_model_capabilities (now in codai.models.capabilities) - Cache functions (now in codai.models.cache) - detect_available_backends, check_flash_attn_availability (now in codai.backends) - ModelBackend abstract class (now in codai.backends.base) - ModelManager, WhisperServerManager, MultiModelManager (now in codai.models.manager) - QueueManager (now in codai.queue.manager) - Utility functions (now in codai.models.utils) The code now properly imports from codai modules instead of having inline duplicates.
-
Your Name authored
- Added complete check_hf_chat_template with global_args support - Added complete get_resolved_model_name - Added complete get_model_family with more model families - Added complete get_reasoning_stop_tokens for more model families - Added complete get_reasoning_system_prompt - Added set_global_args and get_global_args for configuration
-
Your Name authored
-
Your Name authored
- Move NvidiaBackend to codai/backends/cuda.py - Move VulkanBackend to codai/backends/vulkan.py - Move ModelManager, WhisperServerManager, MultiModelManager to codai/models/manager.py - Move QueueManager to codai/queue/manager.py - Add proper exports in codai/backends/__init__.py - Update imports in coderai to use new modules - Fix import paths for base class and cache functions
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
- 16 Mar, 2026 14 commits
-
-
Your Name authored
-
Your Name authored
- Added --force-reasoning with choices: 'stop', 'inject', 'both' (default) - Add model-family detection for reasoning stop tokens - Get appropriate stop tokens for Qwen, DeepSeek, Llama3, Mistral, Gemma, Hermes/Yi - Add system prompt injection for forcing reasoning on non-native models - Add extract_reasoning_content() function to parsers for extracting thinking tags
-
Your Name authored
- Added --force-reasoning argument to enable reasoning mode for models that support it (Qwen3, DeepSeek R1, etc.) - Modified chat_completions endpoint to check both API parameter enable_thinking and CLI flag force_reasoning - When either is true, injects agentic template to enable thinking
-
Your Name authored
- Add enable_thinking parameter to ChatCompletionRequest - When enable_thinking=True, inject agentic system prompt to force thinking/reasoning - Uses AgenticTemplateManager to inject thought tags for supported models
-
Your Name authored
-
Your Name authored
- Add --force-reasoning CLI flag to force thinking mode for models like qwen3 coder - Add check_force_reasoning() function to determine if reasoning should be forced - Modify QwenParser to extract thinking/reasoning content instead of stripping it - Add reasoning field to response message in non-streaming chat completions - Prepend reasoning content to generated text in streaming responses - Update OpenAIFormatter to include reasoning in response when available
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
- Simplify OpenAIFormatter by using litellm's ModelResponse and ChatCompletionChunk directly - Add fallback support for when litellm is not available or fails - Maintain compatibility with existing API - Remove redundant format_litellm_full and format_litellm_chunk methods
-
Your Name authored
The issue was caused by importing StreamingResponse and JSONResponse inside the chat_completions function. In Python, when you have an import statement anywhere inside a function, it creates a local variable for that name throughout the entire function scope. This caused the code in the original implementation path to fail because Python saw StreamingResponse as an unassigned local variable. Fix: Move StreamingResponse and JSONResponse imports to module level and remove redundant imports from inside the function.
-
Your Name authored
-
Your Name authored
The litellm library doesn't export Delta, Choices, etc. directly. Rewrote the formatter to build response dictionaries directly.
-
Your Name authored
-