- 17 Mar, 2026 31 commits
-
-
Your Name authored
- Add selectable parameters to format_for_raw_completion() - inject_system: toggle agentic system prompt injection - force_reasoning: toggle prompt seeding (thought tag) - Update create_reasoning_prompt() convenience function
-
Your Name authored
- Add REASONING_PREFIXES for Big 10 model families (Qwen, Llama3, DeepSeek, etc.) - Add REASONING_STOP_TOKENS for stopping reasoning generation - Add force_reasoning_prompt() to construct prompts ending with thought tags - Add extract_reasoning() to parse reasoning from responses - Add format_for_raw_completion() and create_reasoning_prompt() convenience functions - This enables 'token hijacking' to force models to start with reasoning
-
Your Name authored
- Enhanced flash attention status output in NvidiaBackend to always show availability - Added debug output in chat completions endpoint for force-reasoning mode - Shows CLI flag value, API param, reasoning action, and whether injection was done - Displays the actual injected system prompt content when debug mode is enabled
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now accepts positional args: max_tokens, temperature, top_p, stop
-
Your Name authored
Convert ChatMessage objects to dicts before applying chat template.
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now detects and uses the built-in chat template from GGUF files loaded via llama-cpp-python before falling back to manual formatting.
-
Your Name authored
-
Your Name authored
Now detects GGUF model repos (e.g., unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) and lists available GGUF files before downloading. Prefers Q4_K_M or Q4_K quantizations when available.
-
Your Name authored
Fixed load_model and generate to be non-async methods (matching base class): - load_model: changed from async def returning bool to def returning None - generate: changed from async def to def (removed streaming support in sync version) - Removed 'stream' parameter from generate since it's now sync - chat: changed from async def to def - generate_stream remains async def (correct for streaming)
-
Your Name authored
Added: - get_model_name() - format_messages() - cleanup() These were required by the ModelBackend abstract base class.
-
Your Name authored
Removed ~2050 lines of duplicate code: - Pydantic models (ToolFunction, Tool, ChatMessage, etc.) - now from codai.pydantic - ModelParserAdapter, ToolCallParser - now from codai.models - NvidiaBackend, VulkanBackend - now from codai.backends - All other duplicates removed Now coderai properly imports all classes from codai modules.
-
Your Name authored
Removed ~1500 lines of duplicate code that now exist in codai modules: - ModelCapabilities, detect_model_capabilities (now in codai.models.capabilities) - Cache functions (now in codai.models.cache) - detect_available_backends, check_flash_attn_availability (now in codai.backends) - ModelBackend abstract class (now in codai.backends.base) - ModelManager, WhisperServerManager, MultiModelManager (now in codai.models.manager) - QueueManager (now in codai.queue.manager) - Utility functions (now in codai.models.utils) The code now properly imports from codai modules instead of having inline duplicates.
-
Your Name authored
- Added complete check_hf_chat_template with global_args support - Added complete get_resolved_model_name - Added complete get_model_family with more model families - Added complete get_reasoning_stop_tokens for more model families - Added complete get_reasoning_system_prompt - Added set_global_args and get_global_args for configuration
-
Your Name authored
-
Your Name authored
- Move NvidiaBackend to codai/backends/cuda.py - Move VulkanBackend to codai/backends/vulkan.py - Move ModelManager, WhisperServerManager, MultiModelManager to codai/models/manager.py - Move QueueManager to codai/queue/manager.py - Add proper exports in codai/backends/__init__.py - Update imports in coderai to use new modules - Fix import paths for base class and cache functions
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
- 16 Mar, 2026 9 commits
-
-
Your Name authored
-
Your Name authored
- Added --force-reasoning with choices: 'stop', 'inject', 'both' (default) - Add model-family detection for reasoning stop tokens - Get appropriate stop tokens for Qwen, DeepSeek, Llama3, Mistral, Gemma, Hermes/Yi - Add system prompt injection for forcing reasoning on non-native models - Add extract_reasoning_content() function to parsers for extracting thinking tags
-
Your Name authored
- Added --force-reasoning argument to enable reasoning mode for models that support it (Qwen3, DeepSeek R1, etc.) - Modified chat_completions endpoint to check both API parameter enable_thinking and CLI flag force_reasoning - When either is true, injects agentic template to enable thinking
-
Your Name authored
- Add enable_thinking parameter to ChatCompletionRequest - When enable_thinking=True, inject agentic system prompt to force thinking/reasoning - Uses AgenticTemplateManager to inject thought tags for supported models
-
Your Name authored
-
Your Name authored
- Add --force-reasoning CLI flag to force thinking mode for models like qwen3 coder - Add check_force_reasoning() function to determine if reasoning should be forced - Modify QwenParser to extract thinking/reasoning content instead of stripping it - Add reasoning field to response message in non-streaming chat completions - Prepend reasoning content to generated text in streaming responses - Update OpenAIFormatter to include reasoning in response when available
-
Your Name authored
-
Your Name authored
-
Your Name authored
-