- 17 Mar, 2026 40 commits
-
-
Your Name authored
- System prompt now includes: 'CRITICAL: You must always close your reasoning with ]]> before opening any tool tags' - Extraction logic now uses tool tags as fallback stop markers if close tag is missing - Handles: <tool_call>, <tool>, <|tool_call|>, <|tool|>, <function=
-
Your Name authored
Shows: - Full first pass result - Extraction details (close tag used, reasoning text, final text) - Cleanup details
-
Your Name authored
Now raw mode passes the generated text through OpenAIFormatter which: - Handles tool extraction - Provides OpenAI compatibility - Handles other post-processing This ensures raw mode output is treated the same as regular mode.
-
Your Name authored
- Add cleanup_control_tokens function to strip leading/trailing control tokens - Apply cleanup to final_text and second_pass_result in raw mode - Add mock strategy handling to raw mode (was missing) - Add debug output for cleanup steps
-
Your Name authored
When 'raw' is used without 'prompt', template_manager wasn't defined. Now creating it on-demand when needed.
-
Your Name authored
When 'raw' is used, skip the 'prompt', 'inject', and 'stop' handlers since raw mode handles everything separately. This was causing double assistant headers and corrupted prompts.
-
Your Name authored
The tokenizer approach was causing double assistant headers. Now using only template_manager.format_for_raw_completion which handles everything correctly.
-
Your Name authored
The AgenticTemplateManager already has a format_for_raw_completion method that handles prompt formatting with reasoning tags. No need to manually find the tokenizer - just use the existing template logic.
-
Your Name authored
Now shows: - current_manager type and backend type - Available attributes on the backend - Which path was used to find (or not find) the tokenizer - Also checks model_manager.tokenizer as fallback
-
Your Name authored
Fixed issue where raw mode variables were being re-initialized, which was overwriting the values set in the prompt handling section.
-
Your Name authored
- Added 'raw' to valid force-reasoning options (chat, stop, inject, prompt, twopass, mock, raw) - Implemented raw mode handler that: - Uses tokenizer.apply_chat_template() with add_generation_prompt=True - Seeds reasoning tag + commitment sentence - Uses two-pass generation: first captures reasoning, then gets final answer - Supports both streaming and non-streaming responses - Falls back gracefully if tokenizer not available This enables using the model's native tokenizer for prompt seeding, bypassing double-templating issues with chat APIs.
-
Your Name authored
-
Your Name authored
All Big 10 families now end with '<minimax:tool_call> ' without trailing space
-
Your Name authored
Replace messages with seeded prompt for raw completion
-
Your Name authored
Now ends with ']~b] ' instead of ']~b]'
-
Your Name authored
-
Your Name authored
When both inject and prompt are selected, use the same reasoning tag (]]) for consistency instead of <|thought|>
-
Your Name authored
When --system-prompt is specified, it now prepends to any existing system message instead of replacing it.
-
Your Name authored
- Add ]]> to stop sequences when using 'prompt' option - Add 'mock' strategy to add fake reasoning stats for VSCode plugin - Add 'twopass' option (not yet implemented)
-
Your Name authored
Shows: - Raw model output - Parsed output (after formatter) - Litellm debug info (via --debug)
-
Your Name authored
Use --force-reasoning all to enable chat, stop, inject, and prompt
-
Your Name authored
New options for --force-reasoning: - chat: Enable thinking API parameter - stop: Add reasoning stop tokens - inject: System prompt injection (includes stop) - prompt: Prompt seeding with thought tag (includes stop) Can combine: --force-reasoning chat,inject,prompt Also added force_reasoning_prompt() to templates.py for prompt seeding.
-
Your Name authored
- Add selectable parameters to format_for_raw_completion() - inject_system: toggle agentic system prompt injection - force_reasoning: toggle prompt seeding (thought tag) - Update create_reasoning_prompt() convenience function
-
Your Name authored
- Add REASONING_PREFIXES for Big 10 model families (Qwen, Llama3, DeepSeek, etc.) - Add REASONING_STOP_TOKENS for stopping reasoning generation - Add force_reasoning_prompt() to construct prompts ending with thought tags - Add extract_reasoning() to parse reasoning from responses - Add format_for_raw_completion() and create_reasoning_prompt() convenience functions - This enables 'token hijacking' to force models to start with reasoning
-
Your Name authored
- Enhanced flash attention status output in NvidiaBackend to always show availability - Added debug output in chat completions endpoint for force-reasoning mode - Shows CLI flag value, API param, reasoning action, and whether injection was done - Displays the actual injected system prompt content when debug mode is enabled
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now accepts positional args: max_tokens, temperature, top_p, stop
-
Your Name authored
Convert ChatMessage objects to dicts before applying chat template.
-
Your Name authored
-
Your Name authored
-
Your Name authored
Now detects and uses the built-in chat template from GGUF files loaded via llama-cpp-python before falling back to manual formatting.
-
Your Name authored
-
Your Name authored
Now detects GGUF model repos (e.g., unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF) and lists available GGUF files before downloading. Prefers Q4_K_M or Q4_K quantizations when available.
-
Your Name authored
Fixed load_model and generate to be non-async methods (matching base class): - load_model: changed from async def returning bool to def returning None - generate: changed from async def to def (removed streaming support in sync version) - Removed 'stream' parameter from generate since it's now sync - chat: changed from async def to def - generate_stream remains async def (correct for streaming)
-
Your Name authored
Added: - get_model_name() - format_messages() - cleanup() These were required by the ModelBackend abstract base class.
-
Your Name authored
Removed ~2050 lines of duplicate code: - Pydantic models (ToolFunction, Tool, ChatMessage, etc.) - now from codai.pydantic - ModelParserAdapter, ToolCallParser - now from codai.models - NvidiaBackend, VulkanBackend - now from codai.backends - All other duplicates removed Now coderai properly imports all classes from codai modules.
-