- 17 Mar, 2026 38 commits
-
-
Your Name authored
-
Your Name authored
When both inject and prompt are selected, use the same reasoning tag (]]) for consistency instead of <|thought|>
-
Your Name authored
When --system-prompt is specified, it now prepends to any existing system message instead of replacing it.
-
Your Name authored
- Add ]]> to stop sequences when using 'prompt' option - Add 'mock' strategy to add fake reasoning stats for VSCode plugin - Add 'twopass' option (not yet implemented)
-
Your Name authored
Shows: - Raw model output - Parsed output (after formatter) - Litellm debug info (via --debug)
-
Your Name authored
Use --force-reasoning all to enable chat, stop, inject, and prompt
-
Your Name authored
New options for --force-reasoning: - chat: Enable thinking API parameter - stop: Add reasoning stop tokens - inject: System prompt injection (includes stop) - prompt: Prompt seeding with thought tag (includes stop) Can combine: --force-reasoning chat,inject,prompt Also added force_reasoning_prompt() to templates.py for prompt seeding.
-
Your Name authored
- Add selectable parameters to format_for_raw_completion() - inject_system: toggle agentic system prompt injection - force_reasoning: toggle prompt seeding (thought tag) - Update create_reasoning_prompt() convenience function
-
Your Name authored
- Add REASONING_PREFIXES for Big 10 model families (Qwen, Llama3, DeepSeek, etc.) - Add REASONING_STOP_TOKENS for stopping reasoning generation - Add force_reasoning_prompt() to construct prompts ending with thought tags - Add extract_reasoning() to parse reasoning from responses - Add format_for_raw_completion() and create_reasoning_prompt() convenience functions - This enables 'token hijacking' to force models to start with reasoning
-
Your Name authored
- Enhanced flash attention status output in NvidiaBackend to always show availability - Added debug output in chat completions endpoint for force-reasoning mode - Shows CLI flag value, API param, reasoning action, and whether injection was done - Displays the actual injected system prompt content when debug mode is enabled
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now accepts positional args: max_tokens, temperature, top_p, stop
-
Your Name authored
Convert ChatMessage objects to dicts before applying chat template.
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now detects and uses the built-in chat template from GGUF files loaded via llama-cpp-python before falling back to manual formatting.
-
Your Name authored
-
Your Name authored
Now detects GGUF model repos (e.g., unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) and lists available GGUF files before downloading. Prefers Q4_K_M or Q4_K quantizations when available.
-
Your Name authored
Fixed load_model and generate to be non-async methods (matching base class): - load_model: changed from async def returning bool to def returning None - generate: changed from async def to def (removed streaming support in sync version) - Removed 'stream' parameter from generate since it's now sync - chat: changed from async def to def - generate_stream remains async def (correct for streaming)
-
Your Name authored
Added: - get_model_name() - format_messages() - cleanup() These were required by the ModelBackend abstract base class.
-
Your Name authored
Removed ~2050 lines of duplicate code: - Pydantic models (ToolFunction, Tool, ChatMessage, etc.) - now from codai.pydantic - ModelParserAdapter, ToolCallParser - now from codai.models - NvidiaBackend, VulkanBackend - now from codai.backends - All other duplicates removed Now coderai properly imports all classes from codai modules.
-
Your Name authored
Removed ~1500 lines of duplicate code that now exist in codai modules: - ModelCapabilities, detect_model_capabilities (now in codai.models.capabilities) - Cache functions (now in codai.models.cache) - detect_available_backends, check_flash_attn_availability (now in codai.backends) - ModelBackend abstract class (now in codai.backends.base) - ModelManager, WhisperServerManager, MultiModelManager (now in codai.models.manager) - QueueManager (now in codai.queue.manager) - Utility functions (now in codai.models.utils) The code now properly imports from codai modules instead of having inline duplicates.
-
Your Name authored
- Added complete check_hf_chat_template with global_args support - Added complete get_resolved_model_name - Added complete get_model_family with more model families - Added complete get_reasoning_stop_tokens for more model families - Added complete get_reasoning_system_prompt - Added set_global_args and get_global_args for configuration
-
Your Name authored
-
Your Name authored
- Move NvidiaBackend to codai/backends/cuda.py - Move VulkanBackend to codai/backends/vulkan.py - Move ModelManager, WhisperServerManager, MultiModelManager to codai/models/manager.py - Move QueueManager to codai/queue/manager.py - Add proper exports in codai/backends/__init__.py - Update imports in coderai to use new modules - Fix import paths for base class and cache functions
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
- 16 Mar, 2026 2 commits
-
-
Your Name authored
-
Your Name authored
- Added --force-reasoning with choices: 'stop', 'inject', 'both' (default) - Add model-family detection for reasoning stop tokens - Get appropriate stop tokens for Qwen, DeepSeek, Llama3, Mistral, Gemma, Hermes/Yi - Add system prompt injection for forcing reasoning on non-native models - Add extract_reasoning_content() function to parsers for extracting thinking tags
-