Multimodal capabilities

parent e1bca2d8
...@@ -672,3 +672,20 @@ may consider it more useful to permit linking proprietary applications with ...@@ -672,3 +672,20 @@ may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read Public License instead of this License. But first, please read
<https://www.gnu.org/licenses/why-not-lgpl.html>. <https://www.gnu.org/licenses/why-not-lgpl.html>.
---
Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
# Multimodal Model Capability Indicators - Implementation Summary
## Overview
Added comprehensive multimodal capability detection and display throughout CoderAI's UI, making it easy to identify models that support multiple modalities (text, image, video, audio) before downloading and when browsing the local cache.
## Changes Made
### 1. Enhanced Capability Detection (`codai/models/capabilities.py`)
- **Updated `detect_model_capabilities()`** to return multiple capabilities for multimodal models
- Models now correctly show all their capabilities instead of just one
- Examples:
- Stable Diffusion: `text_generation`, `image_generation`, `image_to_image`, `inpainting`
- LLaVA: `text_generation`, `image_to_text` (vision LLM)
- CogVideoX: `text_generation`, `video_generation` (T2V)
- MusicGen: `text_generation`, `audio_generation` (T2A)
- Whisper: `speech_to_text`, `subtitle_generation` (STT)
### 2. Backend API Updates (`codai/admin/routes.py`)
#### `_scan_caches()` function
- Added capability detection for all cached models (both HuggingFace and GGUF)
- Each model entry now includes a `capabilities` array
- Capabilities are detected from model name/ID using heuristics
#### `api_hf_search()` endpoint
- Added capability detection to search results
- Each search result now includes detected capabilities
- Enables filtering and display of multimodal features
### 3. Web UI Enhancements (`codai/admin/templates/models.html`)
#### Search Interface
- **New capability filter chips** for multimodal search:
- Text, T2I (text-to-image), I2T (image-to-text)
- T2V (text-to-video), I2V (image-to-video)
- T2A (text-to-audio), STT (speech-to-text), TTS (text-to-speech)
- Embeddings
- Plus existing filters (tool calling, vision, reasoning, code, etc.)
- **Capability badges in search results**: Each model shows up to 5 capability badges
- **Client-side filtering**: Filter search results by detected capabilities
#### Local Models View
- **HuggingFace models table**: New "Capabilities" column showing model capabilities
- **GGUF files table**: New "Capabilities" column showing model capabilities
- **Capability badges**: Compact, color-coded badges for quick identification
#### Helper Functions
- `fmtCapabilities()`: Formats capability arrays into compact badge HTML
- Supports 20+ capability types with short labels (T2I, I2T, T2V, etc.)
### 4. Chat Interface (`codai/admin/templates/chat.html`)
- **Multimodal indicators in sidebar**: Models with multiple capabilities show a compact indicator (e.g., "T+I+V" for text+image+video)
- Helps users quickly identify multimodal models when selecting
## Capability Types Supported
### Text & Language
- `text_generation` - LLM chat/completion
- `embeddings` - Text/image embeddings
### Image
- `image_generation` - Text-to-image (Stable Diffusion, FLUX, DALL-E)
- `image_to_image` - Image-to-image transformation
- `image_to_text` - Vision models, VQA, captioning
- `inpainting` - Inpaint with mask
- `controlnet` - ControlNet-guided generation
- `depth_estimation` - Monocular depth estimation
- `image_segmentation` - SAM, Mask R-CNN
- `image_upscaling` - ESRGAN, SwinIR
- `face_restoration` - CodeFormer, GFPGAN
- `object_detection` - YOLO, DETR
### Video
- `video_generation` - Text-to-video (CogVideoX, LTX)
- `image_to_video` - Image-to-video (SVD, I2VGen)
- `video_to_video` - Video style transfer
- `video_interpolation` - Frame interpolation (FILM, RIFE)
- `video_upscaling` - Video super-resolution
### Audio
- `speech_to_text` - Whisper transcription
- `text_to_speech` - Kokoro, Bark, XTTS
- `subtitle_generation` - WhisperX / forced alignment
- `audio_generation` - MusicGen, AudioLDM2
- `audio_to_audio` - Denoising, source separation
### Advanced
- `lip_sync` - Wav2Lip, SadTalker
- `video_dubbing` - Translation + TTS + lip sync
## Usage Examples
### Searching for Multimodal Models
1. Go to **Models****Find on HuggingFace** tab
2. Use capability chips to filter:
- Click "T2I" to find text-to-image models
- Click "I2T" to find vision/VLM models
- Click "T2V" to find text-to-video models
- Combine multiple chips for AND filtering
### Identifying Multimodal Models
- **Before download**: Search results show capability badges
- **In local cache**: Both HF and GGUF tables show capabilities
- **In chat**: Sidebar shows compact multimodal indicators
### Example Models
- **Stable Diffusion XL**: Shows `Text`, `T2I`, `I2I`, `Inpaint` badges
- **LLaVA-1.5**: Shows `Text`, `I2T` badges (vision LLM)
- **CogVideoX**: Shows `Text`, `T2V` badges
- **Whisper**: Shows `STT`, `Subs` badges
## Technical Details
### Detection Logic
- Heuristic-based detection from model name/ID
- Checks for known model families and keywords
- Returns all applicable capabilities (not just primary)
- Fallback to `text_generation` for unknown models
### Performance
- Capability detection runs on-demand (search, cache scan)
- Minimal overhead (~1ms per model)
- Results cached in API responses
### Extensibility
- Easy to add new capability types in `ModelCapabilities` dataclass
- Add detection patterns in `detect_model_capabilities()`
- Update UI labels in `fmtCapabilities()` helper
## Testing
All capability detection tests pass:
- ✓ Stable Diffusion (multimodal: text + image)
- ✓ LLaVA (multimodal: text + vision)
- ✓ CogVideoX (multimodal: text + video)
- ✓ Whisper (audio: STT + subtitles)
- ✓ MusicGen (multimodal: text + audio)
- ✓ GGUF text models (single: text only)
## Future Enhancements
- Add capability-based model recommendations
- Show capability compatibility warnings (e.g., "This model requires vision input")
- Add capability-based sorting in search results
- Support user-defined capability tags
# Multimodal Capability Indicators - UI Examples
## Search Results (HuggingFace)
### Before
```
stable-diffusion-xl-base-1.0
text-to-image ↓ 2.5M ♥ 15k
[Info] [▾ Files] [Download]
```
### After
```
stable-diffusion-xl-base-1.0
text-to-image [Text] [T2I] [I2I] [Inpaint] ↓ 2.5M ♥ 15k
[Info] [▾ Files] [Download]
```
## Local Models (HuggingFace Cache)
### Before
| Model | Size | Files | Config | Actions |
|-------|------|-------|--------|---------|
| meta-llama/Llama-2-7b-chat-hf | 13.5 GB | 42 | enabled | [Load now] [Configure] [Remove] [Delete] |
### After
| Model | Size | Files | Capabilities | Config | Actions |
|-------|------|-------|--------------|--------|---------|
| meta-llama/Llama-2-7b-chat-hf | 13.5 GB | 42 | [Text] | enabled | [Load now] [Configure] [Remove] [Delete] |
| stabilityai/stable-diffusion-xl-base-1.0 | 6.9 GB | 28 | [Text] [T2I] [I2I] [Inpaint] | enabled | [Load now] [Configure] [Remove] [Delete] |
| llava-hf/llava-v1.5-7b-hf | 13.1 GB | 35 | [Text] [I2T] | enabled | [Load now] [Configure] [Remove] [Delete] |
## Local Models (GGUF Cache)
### Before
| File | Size | Config | Actions |
|------|------|--------|---------|
| llama-2-7b-chat.Q4_K_M.gguf | 4.1 GB | enabled | [Load now] [Configure] [Remove] [Delete] |
### After
| File | Size | Capabilities | Config | Actions |
|------|------|--------------|--------|---------|
| llama-2-7b-chat.Q4_K_M.gguf | 4.1 GB | [Text] | enabled | [Load now] [Configure] [Remove] [Delete] |
| stable-diffusion-xl.Q4_K_M.gguf | 3.8 GB | [Text] [T2I] [I2I] | enabled | [Load now] [Configure] [Remove] [Delete] |
## Chat Sidebar
### Before
```
[LLM] llama-2-7b-chat
[IMG] stable-diffusion-xl
[VLM] llava-v1.5-7b
```
### After
```
[LLM] llama-2-7b-chat
[IMG] stable-diffusion-xl T+I+I
[VLM] llava-v1.5-7b T+V
```
## Search Filters
### New Capability Chips (in addition to existing filters)
```
Cap: [Text] [T2I] [I2T] [T2V] [I2V] [T2A] [STT] [TTS] [Embed] [Tool calling] [Vision] [Reasoning] [Code] [Multilingual] [Roleplay] [Math]
```
### Usage
- Click chips to filter models by capability
- Multiple chips = AND filter (model must have all selected capabilities)
- Works with existing filters (size, quant, pipeline, etc.)
## Capability Badge Legend
| Badge | Full Name | Description |
|-------|-----------|-------------|
| Text | Text Generation | LLM chat/completion |
| T2I | Text-to-Image | Generate images from text |
| I2T | Image-to-Text | Vision models, VQA, captioning |
| I2I | Image-to-Image | Transform/edit images |
| T2V | Text-to-Video | Generate videos from text |
| I2V | Image-to-Video | Animate images into videos |
| V2V | Video-to-Video | Transform/edit videos |
| T2A | Text-to-Audio | Generate music/audio from text |
| A2A | Audio-to-Audio | Transform/edit audio |
| STT | Speech-to-Text | Transcribe audio to text |
| TTS | Text-to-Speech | Synthesize speech from text |
| Embed | Embeddings | Generate text/image embeddings |
| Inpaint | Inpainting | Fill masked regions in images |
| ControlNet | ControlNet | Guided image generation |
| Depth | Depth Estimation | Estimate depth from images |
| Segment | Image Segmentation | Segment objects in images |
| Upscale | Image Upscaling | Enhance image resolution |
| Face | Face Restoration | Restore/enhance faces |
| Detect | Object Detection | Detect objects in images |
| Interp | Video Interpolation | Generate intermediate frames |
| V-Upscale | Video Upscaling | Enhance video resolution |
| Lip-sync | Lip Sync | Sync lips to audio |
| Subs | Subtitle Generation | Generate subtitles from audio |
| Dub | Video Dubbing | Translate and dub videos |
## Example Searches
### Find Text-to-Image Models
1. Go to Models → Find on HuggingFace
2. Click "T2I" chip
3. Results show only T2I models (Stable Diffusion, FLUX, etc.)
### Find Vision LLMs (Multimodal)
1. Click both "Text" and "I2T" chips
2. Results show models that can do both text generation and image understanding (LLaVA, Qwen-VL, etc.)
### Find Text-to-Video Models
1. Click "T2V" chip
2. Results show T2V models (CogVideoX, LTX-Video, etc.)
### Find Models with Multiple Capabilities
1. Click multiple capability chips
2. Only models with ALL selected capabilities are shown
3. Great for finding truly multimodal models
# CoderAI # CoderAI
An OpenAI-compatible API server supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Uses HuggingFace Transformers for NVIDIA GPUs and llama-cpp-python with Vulkan for AMD/Intel GPUs. An OpenAI-compatible API server with web administration dashboard, supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Configuration-driven architecture with per-model settings and multi-modal support (text, image, audio, TTS).
## Features ## Features
- **Multi-Backend Support**: ### Core Capabilities
- NVIDIA (CUDA) via PyTorch + Transformers
- AMD GPUs via llama-cpp-python + Vulkan
- Intel GPUs (iGPU/Arc) via llama-cpp-python + Vulkan
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints - **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints
- **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM (NVIDIA) - **Web Admin Dashboard**: Modern UI for model management, user authentication, and API tokens
- **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed (NVIDIA) - **Configuration-Based**: JSON config files for all settings - no complex CLI arguments
- **Multi-GPU Support**: Automatic distribution across multiple CUDA devices (NVIDIA) - **Multi-Modal Support**: Text generation, image generation, audio transcription, text-to-speech
- **GPU Auto-Detection**: Automatically detects available backends - **Per-Model Configuration**: Individual settings for each model (GPU layers, quantization, context size)
- **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes (NVIDIA) or built-in GGUF quantization (Vulkan) - **On-Demand Loading**: Models load automatically when requested, unload when idle
- **Flash Attention 2**: Optional faster attention implementation for supported NVIDIA GPUs
- **Streaming Responses**: Server-sent events for real-time token generation ### GPU Backend Support
- **Tool Calling**: Support for function calling and tool use - **NVIDIA (CUDA)**: PyTorch + Transformers for HuggingFace models
- **Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models` - **AMD GPUs**: llama-cpp-python + Vulkan for GGUF models
- **Intel GPUs**: iGPU/Arc support via Vulkan
- **Auto-Detection**: Automatically selects best available backend
- **Multi-GPU**: Automatic distribution across multiple devices
### Advanced Features
- **Memory Management**: Smart VRAM → RAM → Disk offloading (NVIDIA)
- **Quantization**: 4-bit/8-bit via bitsandbytes (NVIDIA) or GGUF quantization (Vulkan)
- **Flash Attention 2**: Optional faster inference for supported NVIDIA GPUs
- **Streaming**: Server-sent events for real-time token generation
- **Tool Calling**: Function calling and tool use support
- **Authentication**: Session-based auth with API token support
## Installation ## Installation
...@@ -44,19 +52,20 @@ The easiest way to install is using the provided build script: ...@@ -44,19 +52,20 @@ The easiest way to install is using the provided build script:
git clone git@git.nexlab.net:nexlab/coderai.git git clone git@git.nexlab.net:nexlab/coderai.git
cd coderai cd coderai
# For NVIDIA GPUs (default) # Install all backends (recommended)
./build.sh nvidia ./build.sh all
# For AMD or Intel GPUs with Vulkan support # Or install specific backend:
./build.sh vulkan ./build.sh nvidia # NVIDIA GPUs only
./build.sh vulkan # AMD/Intel GPUs only
``` ```
**Note**: The `vulkan` option works for both AMD and Intel GPUs. **Note**: The `all` option installs support for all backends, allowing you to switch between them via configuration. The `vulkan` option works for both AMD and Intel GPUs.
The build script will: The build script will:
- Create a virtual environment - Create a virtual environment
- Install the appropriate dependencies for your GPU - Install the appropriate dependencies for your GPU
- Set up the correct backend - Set up the correct backend(s)
### Manual Installation ### Manual Installation
...@@ -155,216 +164,74 @@ pip install flash-attn --no-build-isolation ...@@ -155,216 +164,74 @@ pip install flash-attn --no-build-isolation
## Usage ## Usage
### Basic Usage ### Quick Start
```bash ```bash
# Activate the virtual environment created by build.sh # Activate the virtual environment
source venv/bin/activate source venv_all/bin/activate # or venv/bin/activate
# Run with NVIDIA backend (HuggingFace models)
python coderai --model microsoft/DialoGPT-medium --backend nvidia
# Run with Vulkan backend (GGUF models)
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# The server will start on http://0.0.0.0:8000 by default # Start the server (uses default config at ~/.coderai/)
``` python coderai
### Command-Line Options
```
usage: coderai [-h] [--model MODEL] [--backend {auto,nvidia,vulkan}] [--host HOST]
[--port PORT] [--offload-dir OFFLOAD_DIR] [--load-in-4bit]
[--load-in-8bit] [--ram RAM] [--flash-attn] [--n-gpu-layers N]
[--n-ctx N]
OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends # Or specify a custom config directory
python coderai --config /path/to/config
options: # Enable debug mode for troubleshooting
-h, --help show this help message and exit python coderai --debug
--model MODEL Model name or path. For NVIDIA: HuggingFace model.
For Vulkan: GGUF file path or HF repo
--backend {auto,nvidia,vulkan}
Backend to use: auto (detect), nvidia (CUDA), or
vulkan (AMD/Intel GPUs via Vulkan)
--host HOST Host to bind to (default: 0.0.0.0)
--port PORT Port to bind to (default: 8000)
--offload-dir OFFLOAD_DIR
Directory for disk offload (NVIDIA only, default: ./offload)
--load-in-4bit Load model in 4-bit precision (NVIDIA only, requires bitsandbytes)
--load-in-8bit Load model in 8-bit precision (NVIDIA only, requires bitsandbytes)
--ram RAM Manually specify available RAM in GB (NVIDIA only)
--flash-attn Use Flash Attention 2 (NVIDIA only, requires flash-attn)
--n-gpu-layers N Number of layers to offload to GPU (Vulkan only,
default: -1 = all layers)
--n-ctx N Context window size (Vulkan only, default: 2048)
--vulkan-device N Vulkan GPU device ID to use (Vulkan only, default: 0)
--vulkan-single-gpu Force Vulkan to use only the specified GPU (prevents layer distribution across multiple GPUs)
--vulkan-list-devices List available Vulkan GPU devices and exit
--reply-filters Enable filtering of model replies. Can be repeated. See "Reply Filters" section for details.
--hf-chat-template Use HuggingFace transformers apply_chat_template. Can be repeated. See "HuggingFace Chat Template" section for details.
``` ```
### Reply Filters The server will start on `http://0.0.0.0:8000` by default.
The `--reply-filters` option controls filtering of model responses. By default, no filtering is applied. Filters can be specified in multiple ways:
**Filter Types:** ### Access Points
- `malformed` - Filter out malformed SEARCH/REPLACE blocks
- `tool_calls` - Strip tool call format tags from output
- `all` - Enable all filters
**Syntax:** - **Admin Dashboard**: http://localhost:8000/admin
- **Chat Interface**: http://localhost:8000/chat
- **API Endpoints**: http://localhost:8000/v1/*
- **API Documentation**: http://localhost:8000/docs
```bash ### First Login
# No filtering (default)
coderai
# Comma-separated - apply to all models
coderai --reply-filters malformed,tool_calls
# Apply to all text models or all image models Default credentials (you'll be prompted to change the password):
coderai --reply-filters text:malformed - **Username**: `admin`
coderai --reply-filters image:tool_calls - **Password**: `admin`
# Apply to SPECIFIC model ### Configuration Files
coderai --reply-filters text:llama-3.1:malformed
coderai --reply-filters image:sd-xl:tool_calls
# Different filters for different models (multiple --reply-filters) CoderAI uses JSON configuration files stored in `~/.coderai/` (or custom directory via `--config`):
coderai --reply-filters text:llama-3.1:malformed --reply-filters text:phi-3:tool_calls --reply-filters image:sd-xl:all
# Apply all filters to specific model
coderai --reply-filters text:llama-3.1:all
``` ```
~/.coderai/
**Filter Syntax Reference:** ├── config.json # Server, backend, and global settings
├── models.json # Model registry and per-model configurations
| Syntax | Applies To | ├── auth.json # Users, API tokens, and sessions
|--------|------------| └── secret_key # Session signing key (auto-generated)
| `all` | All models, all filters |
| `malformed` | All models, malformed filter |
| `tool_calls` | All models, tool_calls filter |
| `text:malformed` | All text models, malformed filter |
| `image:tool_calls` | All image models, tool_calls filter |
| `text:model_name:malformed` | Specific text model, malformed filter |
| `image:model_name:tool_calls` | Specific image model, tool_calls filter |
### HuggingFace Chat Template
The `--hf-chat-template` option enables using HuggingFace's `apply_chat_template` from the transformers library for GGUF models instead of llama.cpp's built-in chat template handling. This provides more consistent chat template formatting that matches HuggingFace models.
**Requirements:**
- `transformers` library must be installed
- The model must be available on HuggingFace Hub or have a `tokenizer_config.json` in the same directory as the GGUF file
**Usage:**
```bash
# Auto-detect and use HuggingFace chat template for all models
coderai --hf-chat-template auto --model llama-3.1-8b-instruct-q4_k_m.gguf
# Auto-detect for all text models
coderai --hf-chat-template text --model llama-3.1-8b-instruct-q4_k_m.gguf
# Use SPECIFIC template for a specific model
coderai --hf-chat-template "llama-3.1:llama3" --model llama-3.1-8b-instruct-q4_k_m.gguf
# Different templates for different models
coderai --hf-chat-template "llama-3.1:llama3" --hf-chat-template "phi-3:chatml"
# Or with Vulkan backend
coderai --backend vulkan --hf-chat-template auto --model llama-3.1-8b-instruct-q4_k_m.gguf
``` ```
**Syntax:** These files are automatically created with sensible defaults on first run.
| Syntax | Applies To |
|--------|------------|
| `--hf-chat-template auto` | Auto-detect and use HF template for all models |
| `--hf-chat-template text` | All text models (auto-detect template) |
| `--hf-chat-template text:model_name` | Specific model (auto-detect template) |
| `--hf-chat-template "model_name:template"` | Specific model with specific template |
**Template Examples:**
- `llama3` - Meta's Llama 3 chat format
- `chatml` - ChatML format
- `qwen` - Qwen chat format
- `phi` - Microsoft Phi chat format
**How it works:**
1. When `--hf-chat-template` is specified, the server attempts to load a HuggingFace tokenizer
2. If a template is specified (e.g., `"llama-3.1:llama3"`), it uses that template directly
3. If no template specified, it auto-detects from the tokenizer (local or HuggingFace Hub)
4. The tokenizer's `apply_chat_template` method is used for formatting chat messages
### Backend Selection
The `--backend` option controls which backend to use:
- **`auto`** (default): Automatically detects available backends, preferring NVIDIA if available
- **`nvidia`**: Use PyTorch + Transformers with CUDA (for NVIDIA GPUs)
- **`vulkan`**: Use llama-cpp-python with Vulkan (for AMD and Intel GPUs)
### Model Formats by Backend
#### NVIDIA Backend
Uses HuggingFace Transformers format:
```bash
python coderai --model microsoft/DialoGPT-medium --backend nvidia
python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia
```
#### Vulkan Backend ### Command-Line Options
Uses GGUF format (can be local files or downloaded from HuggingFace):
```bash
# Local GGUF file
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# Download from HuggingFace (auto-selects GGUF file)
python coderai --model microsoft/Phi-3-mini-4k-instruct-gguf --backend vulkan
# Specific GGUF file from repo
python coderai --model TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf --backend vulkan
```
**Finding GGUF models:**
- Search on HuggingFace: https://huggingface.co/models?search=gguf
- Popular collections: TheBloke, unsloth, bartowski
- Recommended quantization: Q4_K_M for best speed/quality balance
### Examples
#### Run with 4-bit Quantization (Low VRAM)
```bash
python coderai --model meta-llama/Llama-2-7b-chat-hf --load-in-4bit
```
#### Run with Custom Offload Directory
```bash
python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage
```
#### Run on Specific Host/Port
```bash
python coderai --model microsoft/DialoGPT-medium --host 127.0.0.1 --port 8080
```
#### Specify Available RAM Manually
Useful for containerized environments where auto-detection may not work:
```bash
python coderai --model meta-llama/Llama-2-13b-chat-hf --ram 32
``` ```
usage: coderai [-h] [--config CONFIG] [--debug] [--dump]
[--list-cached-models] [--remove-all-models]
[--remove-model REMOVE_MODEL] [--download-model DOWNLOAD_MODEL]
[--download-file-pattern DOWNLOAD_FILE_PATTERN]
[--vulkan-list-devices]
#### Enable Flash Attention 2 OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends
```bash options:
python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn -h, --help show this help message and exit
--config CONFIG Configuration directory (default: ~/.coderai/)
--debug Enable debug mode - dumps full request/response to stdout
--dump Dump model output: raw output, parsed output, and debug info
--list-cached-models List all cached models in the model cache directory
--remove-all-models Remove all cached models from the model cache directory
--remove-model NAME Remove a specific cached model by name or hash
--download-model ID Download a model to cache (URL or HuggingFace model ID)
--download-file-pattern PATTERN
File pattern for HuggingFace downloads (e.g., .gguf, .safetensors)
--vulkan-list-devices List available Vulkan GPU devices and exit
``` ```
## API Documentation ## API Documentation
...@@ -460,7 +327,197 @@ curl -X POST http://localhost:8000/v1/chat/completions \ ...@@ -460,7 +327,197 @@ curl -X POST http://localhost:8000/v1/chat/completions \
}' }'
``` ```
## Configuration for Different Setups ## Configuration
### Configuration Files
All settings are managed through JSON files in the configuration directory (`~/.coderai/` by default):
#### config.json - Server and Backend Settings
```json
{
"server": {
"host": "0.0.0.0",
"port": 8000,
"https": false,
"https_key_path": null,
"https_cert_path": null
},
"backend": {
"type": "auto",
"image_backend": "auto",
"audio_backend": "auto",
"tts_backend": "auto"
},
"models": {
"default_load_mode": "ondemand",
"hf_cache_dir": null,
"gguf_cache_dir": null
},
"offload": {
"directory": "./offload",
"strategy": "auto",
"max_gpu_percent": null,
"no_ram": false,
"load_in_4bit": false,
"load_in_8bit": false,
"manual_ram_gb": null,
"flash_attention": false
},
"vulkan": {
"n_gpu_layers": -1,
"n_ctx": 2048,
"device_id": 0,
"single_gpu": false
},
"image": {
"steps": 4,
"width": 512,
"height": 512,
"cfg_scale": 1.0,
"precision": "f32",
"cpu_offload": false
},
"whisper": {
"server_path": null,
"server_port": 8744
}
}
```
#### models.json - Model Registry
```json
{
"text_models": [
{
"id": "microsoft/DialoGPT-medium",
"backend": "nvidia",
"context_size": 2048,
"n_gpu_layers": -1,
"load_in_4bit": false,
"load_in_8bit": false,
"flash_attention": false,
"enabled": true
},
{
"id": "phi-3-mini-4k-instruct-q4_k_m.gguf",
"backend": "vulkan",
"context_size": 4096,
"n_gpu_layers": -1,
"enabled": true
}
],
"image_models": [
{
"id": "stable-diffusion-xl-base-1.0",
"backend": "nvidia",
"steps": 4,
"width": 512,
"height": 512,
"cfg_scale": 1.0,
"enabled": true
}
],
"audio_models": [],
"vision_models": [],
"tts_models": [],
"loaded": [],
"preload": [],
"aliases": {
"default": "microsoft/DialoGPT-medium"
}
}
```
#### auth.json - Users and API Tokens
```json
{
"users": [
{
"id": "admin",
"username": "admin",
"password_hash": "$argon2id$...",
"role": "admin",
"created_at": "2026-05-05T00:00:00Z"
}
],
"tokens": [
{
"id": "tok_abc123",
"token": "sk-coderai-abc123...",
"name": "Production API",
"created_at": "2026-05-05T00:00:00Z",
"last_used": null
}
],
"sessions": {}
}
```
### Managing Configuration
#### Via Web Dashboard
The easiest way to manage configuration is through the web dashboard at `http://localhost:8000/admin`:
- **Models**: Add, remove, enable/disable models; configure per-model settings
- **Users**: Create users, change passwords, manage roles
- **Tokens**: Generate API tokens for programmatic access
- **Settings**: Adjust server, backend, and global settings
#### Via Configuration Files
You can also edit the JSON files directly. Changes take effect after restarting the server or using the reload endpoint:
```bash
curl -X POST http://localhost:8000/admin/api/system/reload
```
### Per-Model Configuration
Each model can have its own settings that override global defaults:
**Text Models (NVIDIA backend):**
- `backend`: "nvidia" or "vulkan"
- `context_size`: Context window size
- `n_gpu_layers`: Number of layers on GPU (-1 = all)
- `load_in_4bit`: Enable 4-bit quantization
- `load_in_8bit`: Enable 8-bit quantization
- `flash_attention`: Enable Flash Attention 2
**Text Models (Vulkan backend):**
- `backend`: "vulkan"
- `context_size`: Context window size
- `n_gpu_layers`: Number of layers on GPU (-1 = all)
**Image Models:**
- `backend`: "nvidia" or "vulkan"
- `steps`: Number of diffusion steps
- `width`: Image width
- `height`: Image height
- `cfg_scale`: Classifier-free guidance scale
- `precision`: "f32" or "f16"
### Backend Selection
Backends can be configured globally in `config.json` or per-model in `models.json`:
- **`auto`**: Automatically detect and use best available backend
- **`nvidia`**: Use CUDA backend (PyTorch + Transformers)
- **`vulkan`**: Use Vulkan backend (llama-cpp-python)
### Model Loading Modes
Configure in `config.json` under `models.default_load_mode`:
- **`ondemand`** (default): Load models when first requested, unload when idle
- **`preload`**: Load models listed in `models.json``preload` array at startup
- **`lazy`**: Never preload, always load on-demand
## Backend-Specific Setup
### NVIDIA (CUDA) ### NVIDIA (CUDA)
...@@ -471,12 +528,24 @@ curl -X POST http://localhost:8000/v1/chat/completions \ ...@@ -471,12 +528,24 @@ curl -X POST http://localhost:8000/v1/chat/completions \
# Or manually install CUDA-enabled PyTorch # Or manually install CUDA-enabled PyTorch
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
pip install -r requirements-nvidia.txt pip install -r requirements-nvidia.txt
```
# Run with GPU acceleration **Configuration in models.json:**
python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia ```json
{
# Optional: Enable Flash Attention 2 for faster inference "text_models": [
python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia --flash-attn {
"id": "meta-llama/Llama-2-7b-chat-hf",
"backend": "nvidia",
"context_size": 4096,
"n_gpu_layers": -1,
"load_in_4bit": false,
"load_in_8bit": false,
"flash_attention": false,
"enabled": true
}
]
}
``` ```
### AMD and Intel (Vulkan) ### AMD and Intel (Vulkan)
...@@ -492,21 +561,6 @@ sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers intel-gpu- ...@@ -492,21 +561,6 @@ sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers intel-gpu-
# Using build script # Using build script
./build.sh vulkan ./build.sh vulkan
# Run with GGUF model
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# Or download automatically from HuggingFace
python coderai --model TheBloke/Llama-2-7B-GGUF --backend vulkan
# Control GPU layer offloading (default: -1 = all layers)
python coderai --model model.gguf --backend vulkan --n-gpu-layers 35
# Adjust context window (default: 2048)
python coderai --model model.gguf --backend vulkan --n-ctx 4096
# Select specific GPU device (if you have multiple GPUs - e.g., NVIDIA + AMD + Intel)
python coderai --model model.gguf --backend vulkan --vulkan-device 1
# List available Vulkan GPU devices # List available Vulkan GPU devices
python coderai --vulkan-list-devices python coderai --vulkan-list-devices
``` ```
...@@ -527,6 +581,33 @@ python coderai --vulkan-list-devices ...@@ -527,6 +581,33 @@ python coderai --vulkan-list-devices
- Recommended for Intel iGPUs: `Q4_K_M` quantized models under 2GB file size - Recommended for Intel iGPUs: `Q4_K_M` quantized models under 2GB file size
- Intel Arc GPUs work well with the same settings as AMD GPUs - Intel Arc GPUs work well with the same settings as AMD GPUs
**Configuration in models.json:**
```json
{
"text_models": [
{
"id": "phi-3-mini-4k-instruct-q4_k_m.gguf",
"backend": "vulkan",
"context_size": 4096,
"n_gpu_layers": -1,
"enabled": true
}
]
}
```
**Vulkan Configuration in config.json:**
```json
{
"vulkan": {
"n_gpu_layers": -1,
"n_ctx": 2048,
"device_id": 0,
"single_gpu": false
}
}
```
### CPU-Only ### CPU-Only
While not recommended for performance, you can run on CPU: While not recommended for performance, you can run on CPU:
...@@ -535,11 +616,21 @@ While not recommended for performance, you can run on CPU: ...@@ -535,11 +616,21 @@ While not recommended for performance, you can run on CPU:
# NVIDIA backend on CPU # NVIDIA backend on CPU
pip install "torch>=2.0.0" --index-url https://download.pytorch.org/whl/cpu pip install "torch>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements-nvidia.txt pip install -r requirements-nvidia.txt
python coderai --model microsoft/DialoGPT-medium --backend nvidia
# Or Vulkan backend on CPU (llama-cpp supports CPU fallback) # Or Vulkan backend on CPU (llama-cpp supports CPU fallback)
CMAKE_ARGS="-DGGML_VULKAN=OFF" pip install llama-cpp-python CMAKE_ARGS="-DGGML_VULKAN=OFF" pip install llama-cpp-python
python coderai --model model.gguf --backend vulkan ```
Configure in `config.json`:
```json
{
"backend": {
"type": "nvidia"
},
"vulkan": {
"n_gpu_layers": 0
}
}
``` ```
### ROCm Alternative (deprecated) ### ROCm Alternative (deprecated)
...@@ -548,54 +639,65 @@ While the Vulkan backend is now recommended for AMD GPUs, ROCm support is still ...@@ -548,54 +639,65 @@ While the Vulkan backend is now recommended for AMD GPUs, ROCm support is still
### Low VRAM Configuration ### Low VRAM Configuration
For GPUs with limited VRAM (4-8GB): For GPUs with limited VRAM (4-8GB), configure in `config.json` or per-model in `models.json`:
```bash **Global configuration (config.json):**
# Option 1: Use 4-bit quantization ```json
python coderai --model meta-llama/Llama-2-7b-chat-hf --load-in-4bit {
"offload": {
# Option 2: Use 8-bit quantization "load_in_4bit": true,
python coderai --model meta-llama/Llama-2-13b-chat-hf --load-in-8bit "directory": "/path/to/fast/storage"
}
}
```
# Option 3: Enable disk offload for very large models **Per-model configuration (models.json):**
python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage ```json
{
"text_models": [
{
"id": "meta-llama/Llama-2-7b-chat-hf",
"backend": "nvidia",
"load_in_4bit": true,
"enabled": true
}
]
}
``` ```
### Using Vulkan with Multiple GPUs (NVIDIA + AMD) ### Using Vulkan with Multiple GPUs (NVIDIA + AMD)
If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU: If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU, configure in `config.json`:
**Method 1: Use `--vulkan-single-gpu` flag (Recommended)** **Configuration in config.json:**
```bash ```json
# Force all layers onto the specified GPU device only {
# For example, to use only device 1 (AMD GPU): "vulkan": {
python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu --port 6744 "device_id": 1,
"single_gpu": true
# This creates a tensor_split that puts 0% on other GPUs and 100% on the selected GPU }
}
``` ```
**Method 2: Use environment variable to select specific Vulkan device** **Alternative: Environment variables**
```bash ```bash
# List available Vulkan devices first # List available Vulkan devices first
python coderai --vulkan-list-devices python coderai --vulkan-list-devices
# Then use VK_DEVICE_SELECT_DEVICE to force a specific device # Then use VK_DEVICE_SELECT_DEVICE to force a specific device
# For example, if device 1 is your AMD GPU: # For example, if device 1 is your AMD GPU:
VK_DEVICE_SELECT_DEVICE=1 python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744 VK_DEVICE_SELECT_DEVICE=1 python coderai
```
**Method 3: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)** # Or hide NVIDIA GPU from CUDA (prevents any CUDA usage)
```bash CUDA_VISIBLE_DEVICES="" python coderai
# Make NVIDIA GPU invisible to CUDA/Vulkan
CUDA_VISIBLE_DEVICES="" python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
``` ```
**Understanding the Issue:** **Understanding the Issue:**
When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `--vulkan-single-gpu` flag prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU. When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `single_gpu: true` setting prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.
**Notes:** **Notes:**
- The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python - The `device_id` setting maps to `main_gpu` in llama-cpp-python
- The `--vulkan-single-gpu` flag builds a `tensor_split` array to force single GPU usage - The `single_gpu` flag builds a `tensor_split` array to force single GPU usage
- Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs - Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs
- The `vulkaninfo` command shows all GPUs visible to Vulkan - The `vulkaninfo` command shows all GPUs visible to Vulkan
...@@ -608,7 +710,7 @@ Multiple GPUs are automatically detected and utilized. The model will be distrib ...@@ -608,7 +710,7 @@ Multiple GPUs are automatically detected and utilized. The model will be distrib
export CUDA_VISIBLE_DEVICES=0,1,2,3 export CUDA_VISIBLE_DEVICES=0,1,2,3
# Run - model will be distributed across all visible GPUs # Run - model will be distributed across all visible GPUs
python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit python coderai
``` ```
## Model Recommendations ## Model Recommendations
......
#!/bin/bash #!/bin/bash
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# Build script for CoderAI - Supports NVIDIA (CUDA), Vulkan, OpenCL, and CPU backends # Build script for CoderAI - Supports NVIDIA (CUDA), Vulkan, OpenCL, and CPU backends
# Usage: ./build.sh [nvidia|vulkan|vulkan-nvidia|cuda|opencl|all] [--flash] [--venv <venv>] # Usage: ./build.sh [nvidia|vulkan|vulkan-nvidia|cuda|opencl|all] [--flash] [--venv <venv>]
# Default: all (installs all backends) # Default: all (installs all backends)
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# codai module - AI model parsing utilities # codai module - AI model parsing utilities
from .models.parser import ( from .models.parser import (
ModelParserDispatcher, ModelParserDispatcher,
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Admin dashboard package for coderai.""" """Admin dashboard package for coderai."""
from .routes import router from .routes import router
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Authentication and session management for admin dashboard.""" """Authentication and session management for admin dashboard."""
import hashlib import hashlib
import hmac import hmac
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Admin dashboard routes.""" """Admin dashboard routes."""
from pathlib import Path from pathlib import Path
from typing import Optional from typing import Optional
...@@ -261,6 +277,14 @@ async def api_status(username: str = Depends(require_auth)): ...@@ -261,6 +277,14 @@ async def api_status(username: str = Depends(require_auth)):
except Exception: except Exception:
pass pass
# Recent activity
recent_activity = []
try:
from codai.api.log import get_recent_activity
recent_activity = get_recent_activity()
except Exception:
pass
return { return {
"status": "ok", "status": "ok",
"backend": backend, "backend": backend,
...@@ -270,6 +294,7 @@ async def api_status(username: str = Depends(require_auth)): ...@@ -270,6 +294,7 @@ async def api_status(username: str = Depends(require_auth)):
"enabled_models": enabled_models, "enabled_models": enabled_models,
"vram": vram, "vram": vram,
"requests": {"total": req_total, "active": req_active}, "requests": {"total": req_total, "active": req_active},
"recent_activity": recent_activity,
} }
...@@ -706,6 +731,7 @@ def _scan_caches() -> dict: ...@@ -706,6 +731,7 @@ def _scan_caches() -> dict:
result: dict = {"hf": [], "gguf": []} result: dict = {"hf": [], "gguf": []}
from codai.models.cache import get_all_cache_dirs, get_model_cache_dir from codai.models.cache import get_all_cache_dirs, get_model_cache_dir
from codai.models.capabilities import detect_model_capabilities
caches = get_all_cache_dirs() caches = get_all_cache_dirs()
# Collect configured models: key (path/id) → (settings_dict, model_type) # Collect configured models: key (path/id) → (settings_dict, model_type)
...@@ -748,6 +774,7 @@ def _scan_caches() -> dict: ...@@ -748,6 +774,7 @@ def _scan_caches() -> dict:
cfg = (configured_settings.get(fpath) cfg = (configured_settings.get(fpath)
or configured_settings.get(fname) or configured_settings.get(fname)
or ({}, None)) or ({}, None))
caps = detect_model_capabilities(fname)
result["gguf"].append({ result["gguf"].append({
"filename": fname, "filename": fname,
"path": fpath, "path": fpath,
...@@ -756,10 +783,12 @@ def _scan_caches() -> dict: ...@@ -756,10 +783,12 @@ def _scan_caches() -> dict:
"in_config": fpath in configured_settings or fname in configured_settings, "in_config": fpath in configured_settings or fname in configured_settings,
"model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models", "model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models",
"settings": cfg[0] if isinstance(cfg[0], dict) else {}, "settings": cfg[0] if isinstance(cfg[0], dict) else {},
"capabilities": caps.to_list(),
}) })
continue # skip adding to hf list continue # skip adding to hf list
cfg = configured_settings.get(repo.repo_id, ({}, None)) cfg = configured_settings.get(repo.repo_id, ({}, None))
caps = detect_model_capabilities(repo.repo_id)
result["hf"].append({ result["hf"].append({
"id": repo.repo_id, "id": repo.repo_id,
"size_gb": round(size_bytes / 1e9, 2), "size_gb": round(size_bytes / 1e9, 2),
...@@ -770,6 +799,7 @@ def _scan_caches() -> dict: ...@@ -770,6 +799,7 @@ def _scan_caches() -> dict:
"in_config": repo.repo_id in configured_settings, "in_config": repo.repo_id in configured_settings,
"model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models", "model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models",
"settings": cfg[0] if isinstance(cfg[0], dict) else {}, "settings": cfg[0] if isinstance(cfg[0], dict) else {},
"capabilities": caps.to_list(),
}) })
except Exception as e: except Exception as e:
result["hf_error"] = str(e) result["hf_error"] = str(e)
...@@ -784,6 +814,7 @@ def _scan_caches() -> dict: ...@@ -784,6 +814,7 @@ def _scan_caches() -> dict:
cfg = (configured_settings.get(fpath) cfg = (configured_settings.get(fpath)
or configured_settings.get(fname) or configured_settings.get(fname)
or ({}, None)) or ({}, None))
caps = detect_model_capabilities(fname)
result["gguf"].append({ result["gguf"].append({
"filename": fname, "filename": fname,
"path": fpath, "path": fpath,
...@@ -792,6 +823,7 @@ def _scan_caches() -> dict: ...@@ -792,6 +823,7 @@ def _scan_caches() -> dict:
"in_config": fpath in configured_settings or fname in configured_settings, "in_config": fpath in configured_settings or fname in configured_settings,
"model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models", "model_type": cfg[1] if cfg[1] and cfg[1] != "gguf_models" else "text_models",
"settings": cfg[0] if isinstance(cfg[0], dict) else {}, "settings": cfg[0] if isinstance(cfg[0], dict) else {},
"capabilities": caps.to_list(),
}) })
# Add configured GGUF models not yet in the list (e.g., HF repo IDs or external paths) # Add configured GGUF models not yet in the list (e.g., HF repo IDs or external paths)
...@@ -806,6 +838,7 @@ def _scan_caches() -> dict: ...@@ -806,6 +838,7 @@ def _scan_caches() -> dict:
size_bytes = 0 size_bytes = 0
if os.path.isfile(path): if os.path.isfile(path):
size_bytes = os.path.getsize(path) size_bytes = os.path.getsize(path)
caps = detect_model_capabilities(path)
result["gguf"].append({ result["gguf"].append({
"filename": os.path.basename(path) if '/' in path else path, "filename": os.path.basename(path) if '/' in path else path,
"path": path, "path": path,
...@@ -814,6 +847,7 @@ def _scan_caches() -> dict: ...@@ -814,6 +847,7 @@ def _scan_caches() -> dict:
"in_config": True, "in_config": True,
"model_type": mtype if mtype and mtype != "gguf_models" else "text_models", "model_type": mtype if mtype and mtype != "gguf_models" else "text_models",
"settings": settings if isinstance(settings, dict) else {}, "settings": settings if isinstance(settings, dict) else {},
"capabilities": caps.to_list(),
}) })
return result return result
...@@ -1384,6 +1418,7 @@ async def api_hf_search( ...@@ -1384,6 +1418,7 @@ async def api_hf_search(
sort: str = "downloads", sort: str = "downloads",
sizes: str = "", # comma-separated e.g. "7b,70b" sizes: str = "", # comma-separated e.g. "7b,70b"
arch: str = "", arch: str = "",
capabilities: str = "", # comma-separated e.g. "function-calling,vision"
username: str = Depends(require_admin), username: str = Depends(require_admin),
): ):
"""Proxy HuggingFace model search; supports multiple sizes via parallel requests.""" """Proxy HuggingFace model search; supports multiple sizes via parallel requests."""
...@@ -1391,6 +1426,7 @@ async def api_hf_search( ...@@ -1391,6 +1426,7 @@ async def api_hf_search(
import urllib.request import urllib.request
import urllib.parse import urllib.parse
import json as _json import json as _json
from codai.models.capabilities import detect_model_capabilities
if sort not in ("downloads", "likes", "lastModified", "createdAt"): if sort not in ("downloads", "likes", "lastModified", "createdAt"):
sort = "downloads" sort = "downloads"
...@@ -1404,6 +1440,11 @@ async def api_hf_search( ...@@ -1404,6 +1440,11 @@ async def api_hf_search(
if arch == "lora": if arch == "lora":
filter_pairs.append(("filter", "lora")) filter_pairs.append(("filter", "lora"))
# Capability filters
cap_list = [c.strip() for c in capabilities.split(",") if c.strip()]
for cap in cap_list:
filter_pairs.append(("filter", cap))
# Base search keywords # Base search keywords
base_parts = [q.strip()] if q.strip() else [] base_parts = [q.strip()] if q.strip() else []
if arch == "moe": if arch == "moe":
...@@ -1452,12 +1493,24 @@ async def api_hf_search( ...@@ -1452,12 +1493,24 @@ async def api_hf_search(
if gguf_mode == "no-gguf": if gguf_mode == "no-gguf":
merged = [m for m in merged if "gguf" not in (m.get("modelId") or m.get("id", "")).lower()] merged = [m for m in merged if "gguf" not in (m.get("modelId") or m.get("id", "")).lower()]
# Get VRAM info
vram_gb = None
try:
import torch
if torch.cuda.is_available():
free, total = torch.cuda.mem_get_info()
vram_gb = round(free / 1e9, 2)
except Exception:
pass
return [ return [
{ {
"id": m.get("modelId") or m.get("id", ""), "id": m.get("modelId") or m.get("id", ""),
"downloads": m.get("downloads", 0), "downloads": m.get("downloads", 0),
"likes": m.get("likes", 0), "likes": m.get("likes", 0),
"pipeline_tag": m.get("pipeline_tag", ""), "pipeline_tag": m.get("pipeline_tag", ""),
"vram_available": vram_gb,
"capabilities": detect_model_capabilities(m.get("modelId") or m.get("id", "")).to_list(),
} }
for m in merged[:20] for m in merged[:20]
] ]
......
...@@ -729,10 +729,23 @@ function renderSidebar() { ...@@ -729,10 +729,23 @@ function renderSidebar() {
if (!models.length) { el.innerHTML='<div class="muted small" style="padding:.5rem .6rem">No models</div>'; return; } if (!models.length) { el.innerHTML='<div class="muted small" style="padding:.5rem .6rem">No models</div>'; return; }
el.innerHTML = models.map(m => { el.innerHTML = models.map(m => {
const t = m.type || 'text'; const t = m.type || 'text';
const caps = m.capabilities || [];
const safe = JSON.stringify(m).replace(/"/g,'&quot;'); const safe = JSON.stringify(m).replace(/"/g,'&quot;');
// Show multimodal badge if model has multiple capabilities
const capLabels = {
text_generation:'T',image_generation:'I',image_to_text:'V',
video_generation:'Vid',audio_generation:'A',speech_to_text:'STT',
text_to_speech:'TTS',embeddings:'E'
};
const mainCaps = caps.filter(c=>capLabels[c]).slice(0,3);
const capBadges = mainCaps.length > 1
? `<span style="font-size:9px;color:var(--text-3);margin-left:.25rem">${mainCaps.map(c=>capLabels[c]).join('+')}</span>`
: '';
return `<div class="model-item" data-id="${m.id}" onclick="selectModel(${safe})"> return `<div class="model-item" data-id="${m.id}" onclick="selectModel(${safe})">
<span class="mbadge ${BADGE[t]||'mb-text'}">${BLABEL[t]||t}</span> <span class="mbadge ${BADGE[t]||'mb-text'}">${BLABEL[t]||t}</span>
<span style="overflow:hidden;text-overflow:ellipsis;white-space:nowrap;font-size:12px" title="${m.id}">${m.id.split('/').pop()}</span> <span style="overflow:hidden;text-overflow:ellipsis;white-space:nowrap;font-size:12px" title="${m.id}">${m.id.split('/').pop()}${capBadges}</span>
</div>`; </div>`;
}).join(''); }).join('');
} }
......
...@@ -98,6 +98,25 @@ async function poll() { ...@@ -98,6 +98,25 @@ async function poll() {
document.getElementById('req-total').textContent = d.requests.total ?? 0; document.getElementById('req-total').textContent = d.requests.total ?? 0;
document.getElementById('req-active').textContent = d.requests.active ?? 0; document.getElementById('req-active').textContent = d.requests.active ?? 0;
} }
const rows = d.recent_activity || [];
const tbody = document.getElementById('activity-body');
if (rows.length === 0) {
tbody.innerHTML = '<tr class="empty-row"><td colspan="5">No recent activity</td></tr>';
} else {
tbody.innerHTML = rows.map(r => {
const t = new Date(r.time * 1000).toLocaleTimeString();
const ok = r.status >= 200 && r.status < 300;
const badge = ok ? 'badge-admin' : 'badge-danger';
return `<tr>
<td>${t}</td>
<td class="small">${r.model}</td>
<td>${r.type}</td>
<td><span class="badge ${badge}">${r.status}</span></td>
<td>${r.duration}s</td>
</tr>`;
}).join('');
}
} catch { } catch {
document.getElementById('sys-status').textContent = 'Offline'; document.getElementById('sys-status').textContent = 'Offline';
document.getElementById('sys-status').className = 'stat-value small text-red'; document.getElementById('sys-status').className = 'stat-value small text-red';
......
...@@ -179,7 +179,30 @@ ...@@ -179,7 +179,30 @@
</div> </div>
</div> </div>
<!-- filter row 3: quant chips (file-level filter) --> <!-- filter row 3: capability chips -->
<div style="display:flex;align-items:flex-start;gap:.5rem;margin-bottom:.625rem">
<span class="fl" style="padding-top:.25rem;min-width:32px">Cap.</span>
<div class="chip-row" id="cap-chips">
<span class="chip" data-val="text_generation">Text</span>
<span class="chip" data-val="image_generation">T2I</span>
<span class="chip" data-val="image_to_text">I2T</span>
<span class="chip" data-val="video_generation">T2V</span>
<span class="chip" data-val="image_to_video">I2V</span>
<span class="chip" data-val="audio_generation">T2A</span>
<span class="chip" data-val="speech_to_text">STT</span>
<span class="chip" data-val="text_to_speech">TTS</span>
<span class="chip" data-val="embeddings">Embed</span>
<span class="chip" data-val="function-calling">Tool calling</span>
<span class="chip" data-val="vision">Vision</span>
<span class="chip" data-val="reasoning">Reasoning</span>
<span class="chip" data-val="code">Code</span>
<span class="chip" data-val="multilingual">Multilingual</span>
<span class="chip" data-val="roleplay">Roleplay</span>
<span class="chip" data-val="math">Math</span>
</div>
</div>
<!-- filter row 4: quant chips (file-level filter) -->
<div style="display:flex;align-items:flex-start;gap:.5rem;margin-bottom:1rem"> <div style="display:flex;align-items:flex-start;gap:.5rem;margin-bottom:1rem">
<span class="fl" style="padding-top:.25rem;min-width:32px">Quant</span> <span class="fl" style="padding-top:.25rem;min-width:32px">Quant</span>
<div class="chip-row" id="quant-chips"> <div class="chip-row" id="quant-chips">
...@@ -440,6 +463,21 @@ function fmtNum(n){if(!n)return'0';return n>=1e6?(n/1e6).toFixed(1)+'M':n>=1000? ...@@ -440,6 +463,21 @@ function fmtNum(n){if(!n)return'0';return n>=1e6?(n/1e6).toFixed(1)+'M':n>=1000?
function fmtGB(gb){if(!gb)return'—';return gb>=1?gb.toFixed(1)+' GB':(gb*1024).toFixed(0)+' MB'} function fmtGB(gb){if(!gb)return'—';return gb>=1?gb.toFixed(1)+' GB':(gb*1024).toFixed(0)+' MB'}
function fmtDate(s){try{return new Date(s).toLocaleDateString(undefined,{year:'numeric',month:'short',day:'numeric'})}catch{return s}} function fmtDate(s){try{return new Date(s).toLocaleDateString(undefined,{year:'numeric',month:'short',day:'numeric'})}catch{return s}}
function fmtCapabilities(caps){
if(!caps||!caps.length)return'';
const labels={
text_generation:'Text',image_generation:'T2I',image_to_text:'I2T',
video_generation:'T2V',image_to_video:'I2V',audio_generation:'T2A',
speech_to_text:'STT',text_to_speech:'TTS',embeddings:'Embed',
image_to_image:'I2I',video_to_video:'V2V',audio_to_audio:'A2A',
inpainting:'Inpaint',controlnet:'ControlNet',depth_estimation:'Depth',
image_segmentation:'Segment',image_upscaling:'Upscale',face_restoration:'Face',
object_detection:'Detect',video_interpolation:'Interp',video_upscaling:'V-Upscale',
lip_sync:'Lip-sync',subtitle_generation:'Subs',video_dubbing:'Dub'
};
return caps.slice(0,5).map(c=>`<span class="badge badge-user" style="font-size:10px;padding:.15rem .35rem">${esc(labels[c]||c)}</span>`).join(' ');
}
/* ── tab / modal ─────────────────────────────────────── */ /* ── tab / modal ─────────────────────────────────────── */
function switchTab(name,btn){ function switchTab(name,btn){
document.querySelectorAll('.tab-panel').forEach(p=>p.classList.remove('active')); document.querySelectorAll('.tab-panel').forEach(p=>p.classList.remove('active'));
...@@ -450,6 +488,19 @@ function switchTab(name,btn){ ...@@ -450,6 +488,19 @@ function switchTab(name,btn){
function openModal(id){document.getElementById(id).classList.add('show')} function openModal(id){document.getElementById(id).classList.add('show')}
function closeModal(id){document.getElementById(id).classList.remove('show')} function closeModal(id){document.getElementById(id).classList.remove('show')}
/* ── Global settings ─────────────────────────────────── */
let _defaultOffloadDir = './offload';
async function loadGlobalSettings(){
try{
const r = await fetch('/admin/api/settings');
if(r.ok){
const d = await r.json();
_defaultOffloadDir = d.offload?.directory || './offload';
}
}catch{}
}
/* ── GGUF format toggle ──────────────────────────────── */ /* ── GGUF format toggle ──────────────────────────────── */
let _ggufMode = 'gguf'; let _ggufMode = 'gguf';
document.querySelectorAll('.tog-btn').forEach(btn=>{ document.querySelectorAll('.tog-btn').forEach(btn=>{
...@@ -471,6 +522,18 @@ let _results = []; ...@@ -471,6 +522,18 @@ let _results = [];
let _filesCache = {}; let _filesCache = {};
let _activeQuants = new Set(); let _activeQuants = new Set();
function estimateModelSize(modelId){
const id = modelId.toLowerCase();
// Extract parameter count (e.g., 7b, 13b, 70b)
const match = id.match(/(\d+\.?\d*)b/);
if(!match) return 8; // default guess
const params = parseFloat(match[1]);
// Rough estimate: Q4 ≈ 0.5GB per B params, Q8 ≈ 1GB per B, FP16 ≈ 2GB per B
if(id.includes('q4') || id.includes('4bit')) return params * 0.5;
if(id.includes('q8') || id.includes('8bit')) return params * 1.0;
return params * 2; // assume FP16
}
document.getElementById('search-q').addEventListener('keydown',e=>{if(e.key==='Enter')doSearch()}); document.getElementById('search-q').addEventListener('keydown',e=>{if(e.key==='Enter')doSearch()});
async function doSearch(){ async function doSearch(){
...@@ -482,6 +545,10 @@ async function doSearch(){ ...@@ -482,6 +545,10 @@ async function doSearch(){
const sizes = getChips('size-chips').join(','); const sizes = getChips('size-chips').join(',');
_activeQuants = new Set(getChips('quant-chips').map(v=>v.toUpperCase().split(' ')[0])); // strip ★ _activeQuants = new Set(getChips('quant-chips').map(v=>v.toUpperCase().split(' ')[0])); // strip ★
// Get selected capability filters (from our custom chips)
const selectedCaps = getChips('cap-chips');
const capFilters = selectedCaps.filter(c=>!['function-calling','vision','reasoning','code','multilingual','roleplay','math'].includes(c));
_filesCache = {}; _filesCache = {};
_results = []; _results = [];
out.innerHTML = '<span class="muted small">Searching HuggingFace…</span>'; out.innerHTML = '<span class="muted small">Searching HuggingFace…</span>';
...@@ -490,20 +557,43 @@ async function doSearch(){ ...@@ -490,20 +557,43 @@ async function doSearch(){
if(pipeline) params.append('pipeline_tag', pipeline); if(pipeline) params.append('pipeline_tag', pipeline);
if(sizes) params.append('sizes', sizes); if(sizes) params.append('sizes', sizes);
if(arch) params.append('arch', arch); if(arch) params.append('arch', arch);
const caps = getChips('cap-chips');
if(caps.length) params.append('capabilities', caps.join(','));
try{ try{
const r = await fetch('/admin/api/hf-search?'+params); const r = await fetch('/admin/api/hf-search?'+params);
if(!r.ok){const e=await r.json();throw new Error(e.detail||r.statusText)} if(!r.ok){const e=await r.json();throw new Error(e.detail||r.statusText)}
_results = await r.json(); _results = await r.json();
// Client-side filter by detected capabilities if any custom caps selected
if(capFilters.length > 0){
_results = _results.filter(m=>{
const modelCaps = m.capabilities || [];
return capFilters.some(cf=>modelCaps.includes(cf));
});
}
if(!_results.length){out.innerHTML='<span class="muted small">No results. Try different keywords or fewer filters.</span>';return} if(!_results.length){out.innerHTML='<span class="muted small">No results. Try different keywords or fewer filters.</span>';return}
out.innerHTML = _results.map((m,i)=>`
const vramAvail = _results[0]?.vram_available;
out.innerHTML = _results.map((m,i)=>{
let vramDot = '';
if(vramAvail){
const estSize = estimateModelSize(m.id);
const color = estSize <= vramAvail*0.8 ? '#10b981' : estSize <= vramAvail*0.95 ? '#f59e0b' : '#ef4444';
vramDot = `<span style="display:inline-block;width:8px;height:8px;border-radius:50%;background:${color};margin-right:.35rem" title="Est. ${estSize}GB / ${vramAvail}GB available"></span>`;
}
const capBadges = fmtCapabilities(m.capabilities||[]);
return `
<div style="padding:.75rem 0;border-bottom:1px solid var(--border)"> <div style="padding:.75rem 0;border-bottom:1px solid var(--border)">
<div style="display:flex;align-items:flex-start;justify-content:space-between;gap:.5rem"> <div style="display:flex;align-items:flex-start;justify-content:space-between;gap:.5rem">
<div style="min-width:0;flex:1"> <div style="min-width:0;flex:1">
<div style="font-weight:500;font-size:13px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" <div style="font-weight:500;font-size:13px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap;display:flex;align-items:center"
title="${esc(m.id)}">${esc(m.id)}</div> title="${esc(m.id)}">${vramDot}${esc(m.id)}</div>
<div style="font-size:11px;color:var(--text-3);margin-top:.25rem;display:flex;align-items:center;gap:.5rem;flex-wrap:wrap"> <div style="font-size:11px;color:var(--text-3);margin-top:.25rem;display:flex;align-items:center;gap:.5rem;flex-wrap:wrap">
${m.pipeline_tag?`<span class="badge badge-user">${esc(m.pipeline_tag)}</span>`:''} ${m.pipeline_tag?`<span class="badge badge-user">${esc(m.pipeline_tag)}</span>`:''}
${capBadges}
<span>↓ ${fmtNum(m.downloads)}</span> <span>↓ ${fmtNum(m.downloads)}</span>
<span>♥ ${fmtNum(m.likes)}</span> <span>♥ ${fmtNum(m.likes)}</span>
</div> </div>
...@@ -517,7 +607,8 @@ async function doSearch(){ ...@@ -517,7 +607,8 @@ async function doSearch(){
<div id="fp-${i}" style="display:none;margin-top:.625rem;padding:.5rem .625rem;background:var(--raised);border-radius:6px"> <div id="fp-${i}" style="display:none;margin-top:.625rem;padding:.5rem .625rem;background:var(--raised);border-radius:6px">
<span class="muted small">Loading…</span> <span class="muted small">Loading…</span>
</div> </div>
</div>`).join(''); </div>`;
}).join('');
}catch(e){ }catch(e){
out.innerHTML='<span class="muted small">Error: '+esc(e.message)+'</span>'; out.innerHTML='<span class="muted small">Error: '+esc(e.message)+'</span>';
} }
...@@ -869,10 +960,12 @@ async function loadCachedModels(){ ...@@ -869,10 +960,12 @@ async function loadCachedModels(){
_localModels.push({label:m.id, path:m.id, cacheType:'hf', size_gb:m.size_gb||0, _localModels.push({label:m.id, path:m.id, cacheType:'hf', size_gb:m.size_gb||0,
defaultType:m.model_type||'text_models', settings:m.settings||{}, in_config:m.in_config}); defaultType:m.model_type||'text_models', settings:m.settings||{}, in_config:m.in_config});
const loaded = _loadedKeys.has(m.id) || [..._loadedKeys].some(k=>k.endsWith(':'+m.id)||k===m.id); const loaded = _loadedKeys.has(m.id) || [..._loadedKeys].some(k=>k.endsWith(':'+m.id)||k===m.id);
const capBadges = fmtCapabilities(m.capabilities||[]);
return `<tr style="border-top:1px solid var(--border)"> return `<tr style="border-top:1px solid var(--border)">
<td style="padding:.4rem .25rem;font-family:monospace;font-size:12px;max-width:260px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(m.id)}">${esc(m.id)}</td> <td style="padding:.4rem .25rem;font-family:monospace;font-size:12px;max-width:260px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(m.id)}">${esc(m.id)}</td>
<td style="text-align:right;padding:.4rem .25rem;white-space:nowrap;color:var(--text-2)">${fmtGB(m.size_gb)}</td> <td style="text-align:right;padding:.4rem .25rem;white-space:nowrap;color:var(--text-2)">${fmtGB(m.size_gb)}</td>
<td style="text-align:right;padding:.4rem .25rem;color:var(--text-2)">${m.file_count}</td> <td style="text-align:right;padding:.4rem .25rem;color:var(--text-2)">${m.file_count}</td>
<td style="padding:.4rem .25rem;font-size:11px">${capBadges||'<span class="muted small">—</span>'}</td>
<td style="text-align:center;padding:.4rem .25rem">${m.in_config?'<span class="badge badge-ok">enabled</span>':'<span class="muted small">—</span>'}</td> <td style="text-align:center;padding:.4rem .25rem">${m.in_config?'<span class="badge badge-ok">enabled</span>':'<span class="muted small">—</span>'}</td>
<td style="padding:.4rem .25rem;text-align:right;white-space:nowrap"> <td style="padding:.4rem .25rem;text-align:right;white-space:nowrap">
${m.in_config?(loaded ${m.in_config?(loaded
...@@ -889,6 +982,7 @@ async function loadCachedModels(){ ...@@ -889,6 +982,7 @@ async function loadCachedModels(){
'<th style="text-align:left;padding:.3rem .25rem;font-weight:700">Model</th>'+ '<th style="text-align:left;padding:.3rem .25rem;font-weight:700">Model</th>'+
'<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Size</th>'+ '<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Size</th>'+
'<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Files</th>'+ '<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Files</th>'+
'<th style="text-align:left;padding:.3rem .25rem;font-weight:700">Capabilities</th>'+
'<th style="text-align:center;padding:.3rem .25rem;font-weight:700">Config</th>'+ '<th style="text-align:center;padding:.3rem .25rem;font-weight:700">Config</th>'+
'<th></th></tr></thead><tbody>'+rows.join('')+'</tbody></table>'; '<th></th></tr></thead><tbody>'+rows.join('')+'</tbody></table>';
} }
...@@ -904,9 +998,11 @@ async function loadCachedModels(){ ...@@ -904,9 +998,11 @@ async function loadCachedModels(){
_localModels.push({label:f.filename, path:f.path, cacheType:'gguf', size_gb:f.size_gb||0, _localModels.push({label:f.filename, path:f.path, cacheType:'gguf', size_gb:f.size_gb||0,
defaultType:f.model_type||'text_models', settings:f.settings||{}, in_config:f.in_config}); defaultType:f.model_type||'text_models', settings:f.settings||{}, in_config:f.in_config});
const loaded = _loadedKeys.has(f.path) || _loadedKeys.has(f.filename) || [..._loadedKeys].some(k=>k.endsWith(':'+f.path)||k.endsWith(':'+f.filename)); const loaded = _loadedKeys.has(f.path) || _loadedKeys.has(f.filename) || [..._loadedKeys].some(k=>k.endsWith(':'+f.path)||k.endsWith(':'+f.filename));
const capBadges = fmtCapabilities(f.capabilities||[]);
return `<tr style="border-top:1px solid var(--border)"> return `<tr style="border-top:1px solid var(--border)">
<td style="padding:.4rem .25rem;font-family:monospace;font-size:11px;max-width:320px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(f.filename)}">${esc(f.filename)}</td> <td style="padding:.4rem .25rem;font-family:monospace;font-size:11px;max-width:320px;overflow:hidden;text-overflow:ellipsis;white-space:nowrap" title="${esc(f.filename)}">${esc(f.filename)}</td>
<td style="text-align:right;padding:.4rem .25rem;white-space:nowrap;color:var(--text-2)">${fmtGB(f.size_gb)}</td> <td style="text-align:right;padding:.4rem .25rem;white-space:nowrap;color:var(--text-2)">${fmtGB(f.size_gb)}</td>
<td style="padding:.4rem .25rem;font-size:11px">${capBadges||'<span class="muted small">—</span>'}</td>
<td style="text-align:center;padding:.4rem .25rem">${f.in_config?'<span class="badge badge-ok">enabled</span>':'<span class="muted small">—</span>'}</td> <td style="text-align:center;padding:.4rem .25rem">${f.in_config?'<span class="badge badge-ok">enabled</span>':'<span class="muted small">—</span>'}</td>
<td style="padding:.4rem .25rem;text-align:right;white-space:nowrap"> <td style="padding:.4rem .25rem;text-align:right;white-space:nowrap">
${f.in_config?(loaded ${f.in_config?(loaded
...@@ -922,6 +1018,7 @@ async function loadCachedModels(){ ...@@ -922,6 +1018,7 @@ async function loadCachedModels(){
'<thead><tr style="color:var(--text-2);font-size:10px;text-transform:uppercase;letter-spacing:.05em">'+ '<thead><tr style="color:var(--text-2);font-size:10px;text-transform:uppercase;letter-spacing:.05em">'+
'<th style="text-align:left;padding:.3rem .25rem;font-weight:700">File</th>'+ '<th style="text-align:left;padding:.3rem .25rem;font-weight:700">File</th>'+
'<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Size</th>'+ '<th style="text-align:right;padding:.3rem .25rem;font-weight:700">Size</th>'+
'<th style="text-align:left;padding:.3rem .25rem;font-weight:700">Capabilities</th>'+
'<th style="text-align:center;padding:.3rem .25rem;font-weight:700">Config</th>'+ '<th style="text-align:center;padding:.3rem .25rem;font-weight:700">Config</th>'+
'<th></th></tr></thead><tbody>'+rows.join('')+'</tbody></table>'; '<th></th></tr></thead><tbody>'+rows.join('')+'</tbody></table>';
} }
...@@ -945,6 +1042,7 @@ async function refreshLocal(){ ...@@ -945,6 +1042,7 @@ async function refreshLocal(){
loadCachedModels(); loadCachedModels();
} }
loadGlobalSettings();
refreshLocal(); refreshLocal();
async function clearCacheConfirm(type){ async function clearCacheConfirm(type){
...@@ -1000,7 +1098,7 @@ function openCfgModal(idx){ ...@@ -1000,7 +1098,7 @@ function openCfgModal(idx){
document.getElementById('cfg-flash').checked = !!s.flash_attention; document.getElementById('cfg-flash').checked = !!s.flash_attention;
document.getElementById('cfg-noram').checked = !!s.no_ram; document.getElementById('cfg-noram').checked = !!s.no_ram;
document.getElementById('cfg-offload-strategy').value = s.offload_strategy || 'auto'; document.getElementById('cfg-offload-strategy').value = s.offload_strategy || 'auto';
document.getElementById('cfg-offload-dir').value = s.offload_dir || './offload'; document.getElementById('cfg-offload-dir').value = s.offload_dir || _defaultOffloadDir;
document.getElementById('cfg-sysprompt').value = s.system_prompt || ''; document.getElementById('cfg-sysprompt').value = s.system_prompt || '';
document.getElementById('cfg-parser').value = s.parser || 'auto'; document.getElementById('cfg-parser').value = s.parser || 'auto';
document.getElementById('cfg-tools').checked = !!s.tools_closer_prompt; document.getElementById('cfg-tools').checked = !!s.tools_closer_prompt;
......
...@@ -54,10 +54,15 @@ ...@@ -54,10 +54,15 @@
<label class="form-label">HuggingFace cache directory <span class="muted">(leave blank for default ~/.cache/huggingface)</span></label> <label class="form-label">HuggingFace cache directory <span class="muted">(leave blank for default ~/.cache/huggingface)</span></label>
<input type="text" id="s-hf-cache" class="form-input" placeholder="e.g. /data/models/huggingface"> <input type="text" id="s-hf-cache" class="form-input" placeholder="e.g. /data/models/huggingface">
</div> </div>
<div class="form-row" style="margin:0"> <div class="form-row">
<label class="form-label">GGUF cache directory <span class="muted">(leave blank for default ~/.cache/coderai/models)</span></label> <label class="form-label">GGUF cache directory <span class="muted">(leave blank for default ~/.cache/coderai/models)</span></label>
<input type="text" id="s-gguf-cache" class="form-input" placeholder="e.g. /data/models/gguf"> <input type="text" id="s-gguf-cache" class="form-input" placeholder="e.g. /data/models/gguf">
</div> </div>
<div class="form-row" style="margin:0">
<label class="form-label">Default offload directory <span class="muted">(default: ./offload)</span></label>
<input type="text" id="s-offload-dir" class="form-input" placeholder="./offload">
<span class="form-hint">Models will inherit this as default when configured</span>
</div>
</div> </div>
{% endblock %} {% endblock %}
...@@ -86,6 +91,7 @@ async function loadSettings(){ ...@@ -86,6 +91,7 @@ async function loadSettings(){
document.getElementById('s-cert').value = d.server?.https_cert_path ?? ''; document.getElementById('s-cert').value = d.server?.https_cert_path ?? '';
document.getElementById('s-hf-cache').value = d.models?.hf_cache_dir ?? ''; document.getElementById('s-hf-cache').value = d.models?.hf_cache_dir ?? '';
document.getElementById('s-gguf-cache').value = d.models?.gguf_cache_dir ?? ''; document.getElementById('s-gguf-cache').value = d.models?.gguf_cache_dir ?? '';
document.getElementById('s-offload-dir').value = d.offload?.directory ?? './offload';
toggleHttps(); toggleHttps();
}catch(e){ showAlert('error','Failed to load settings: '+e.message); } }catch(e){ showAlert('error','Failed to load settings: '+e.message); }
} }
...@@ -103,6 +109,9 @@ async function saveSettings(){ ...@@ -103,6 +109,9 @@ async function saveSettings(){
models:{ models:{
hf_cache_dir: strOrNull('s-hf-cache'), hf_cache_dir: strOrNull('s-hf-cache'),
gguf_cache_dir: strOrNull('s-gguf-cache'), gguf_cache_dir: strOrNull('s-gguf-cache'),
},
offload:{
directory: document.getElementById('s-offload-dir').value.trim() || './offload',
} }
}; };
try{ try{
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# codai.api - FastAPI application module # codai.api - FastAPI application module
from .app import app from .app import app
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
FastAPI application module for codai API. FastAPI application module for codai API.
Contains the FastAPI app initialization, lifespan, and core endpoints. Contains the FastAPI app initialization, lifespan, and core endpoints.
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Audio generation endpoints for the codai API. Audio generation endpoints for the codai API.
Supports music, sound effects, and ambient audio via MusicGen, AudioLDM2, StableAudio, etc. Supports music, sound effects, and ambient audio via MusicGen, AudioLDM2, StableAudio, etc.
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Embeddings endpoint — OpenAI-compatible. Embeddings endpoint — OpenAI-compatible.
POST /v1/embeddings POST /v1/embeddings
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Image generation endpoints for the codai API. Image generation endpoints for the codai API.
""" """
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Request logging middleware for the codai API. Request logging middleware for the codai API.
""" """
import json import json
import time
from collections import deque
from fastapi import Request from fastapi import Request
# In-memory ring buffer of recent API requests (max 50)
_activity: deque = deque(maxlen=50)
def get_recent_activity():
return list(_activity)
_TRACKED_PATHS = {
"/v1/chat/completions": "chat",
"/v1/completions": "completion",
"/v1/images/generations": "image",
"/v1/audio/speech": "tts",
"/v1/audio/transcriptions": "transcription",
"/v1/embeddings": "embedding",
}
async def log_requests(request: Request, call_next): async def log_requests(request: Request, call_next):
"""Log all incoming requests for debugging.""" """Log all incoming requests for debugging."""
# Import global debug flag from state
from codai.api.state import get_global_debug from codai.api.state import get_global_debug
global_debug = get_global_debug() global_debug = get_global_debug()
if request.url.path in ["/v1/chat/completions", "/v1/completions"]: path = request.url.path
tracked = path in _TRACKED_PATHS
if tracked or path in ["/v1/chat/completions", "/v1/completions"]:
body = b"" body = b""
body_str = "" body_str = ""
model = "—"
try: try:
body = await request.body() body = await request.body()
body_str = body.decode('utf-8') body_str = body.decode('utf-8')
parsed = json.loads(body_str)
model = parsed.get("model", "—")
# In debug mode, dump the full request
if global_debug: if global_debug:
print(f"\n{'='*80}") print(f"\n{'='*80}")
print(f"=== FULL REQUEST DEBUG ===") print(f"=== FULL REQUEST DEBUG ===")
print(f"{'='*80}") print(f"Method: {request.method} URL: {request.url}")
print(f"Method: {request.method}")
print(f"URL: {request.url}")
print(f"Headers:")
for k, v in request.headers.items():
print(f" {k}: {v}")
print(f"\n--- Body ---")
# Print full body without truncation
try:
# Try to pretty-print JSON
parsed = json.loads(body_str)
print(json.dumps(parsed, indent=2)) print(json.dumps(parsed, indent=2))
except:
# If not JSON, print as-is
print(body_str)
print(f"{'='*80}\n") print(f"{'='*80}\n")
except Exception as e: except Exception as e:
if global_debug:
print(f"Error reading request body: {e}") print(f"Error reading request body: {e}")
# Call the next middleware/handler t0 = time.time()
response = await call_next(request) response = await call_next(request)
duration = time.time() - t0
if tracked:
_activity.appendleft({
"time": int(t0),
"model": model,
"type": _TRACKED_PATHS[path],
"status": response.status_code,
"duration": round(duration, 2),
})
# Log response status
if global_debug: if global_debug:
print(f"DEBUG: Response status: {response.status_code}") print(f"DEBUG: Response status: {response.status_code}")
return response return response
else: else:
# For non-chat endpoints, just pass through return await call_next(request)
response = await call_next(request) \ No newline at end of file
return response
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Global state for codai API modules.""" """Global state for codai API modules."""
from typing import Any, Optional from typing import Any, Optional
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Text generation endpoints for the codai API. Text generation endpoints for the codai API.
""" """
...@@ -1037,6 +1053,9 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request ...@@ -1037,6 +1053,9 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
prompt_tokens = len(raw_prompt_for_generation.split()) prompt_tokens = len(raw_prompt_for_generation.split())
completion_tokens = len(clean_text.split()) if clean_text else 0 completion_tokens = len(clean_text.split()) if clean_text else 0
# Get context size
context_size = current_manager.get_context_size()
# Step 2: Use OpenAIFormatter for final formatting # Step 2: Use OpenAIFormatter for final formatting
formatter = OpenAIFormatter(response_model_name) formatter = OpenAIFormatter(response_model_name)
try: try:
...@@ -1044,7 +1063,8 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request ...@@ -1044,7 +1063,8 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
text=clean_text, text=clean_text,
prompt_tokens=prompt_tokens, prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens, completion_tokens=completion_tokens,
tool_calls=extracted_tool_calls tool_calls=extracted_tool_calls,
context_size=context_size
) )
except Exception as e: except Exception as e:
print(f"RAW: ERROR in formatter.format_full: {e}") print(f"RAW: ERROR in formatter.format_full: {e}")
...@@ -1135,7 +1155,8 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request ...@@ -1135,7 +1155,8 @@ async def chat_completions(request: ChatCompletionRequest, http_request: Request
"usage": { "usage": {
"prompt_tokens": prompt_tokens, "prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens, "completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens "total_tokens": prompt_tokens + completion_tokens,
"context_size": context_size
} }
} }
...@@ -1437,6 +1458,9 @@ async def stream_chat_response( ...@@ -1437,6 +1458,9 @@ async def stream_chat_response(
prompt_tokens = len(prompt_text.split()) prompt_tokens = len(prompt_text.split())
completion_tokens = len(generated_text.split()) if generated_text else 0 completion_tokens = len(generated_text.split()) if generated_text else 0
# Get context size
context_size = current_manager.get_context_size()
# Use OpenAIFormatter for final chunk sanitization # Use OpenAIFormatter for final chunk sanitization
formatter = OpenAIFormatter(model_name) formatter = OpenAIFormatter(model_name)
usage_details = { usage_details = {
...@@ -1444,7 +1468,7 @@ async def stream_chat_response( ...@@ -1444,7 +1468,7 @@ async def stream_chat_response(
"completion_tokens": completion_tokens, "completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens, "total_tokens": prompt_tokens + completion_tokens,
} }
final_chunk = formatter.format_litellm_chunk("", is_final=True, usage=usage_details) final_chunk = formatter.format_litellm_chunk("", is_final=True, usage=usage_details, context_size=context_size)
yield f"data: {json.dumps(final_chunk)}\n\n" yield f"data: {json.dumps(final_chunk)}\n\n"
else: else:
# Calculate token counts for usage in final chunk # Calculate token counts for usage in final chunk
...@@ -1452,6 +1476,9 @@ async def stream_chat_response( ...@@ -1452,6 +1476,9 @@ async def stream_chat_response(
prompt_tokens = len(prompt_text.split()) prompt_tokens = len(prompt_text.split())
completion_tokens = len(generated_text.split()) if generated_text else 0 completion_tokens = len(generated_text.split()) if generated_text else 0
# Get context size
context_size = current_manager.get_context_size()
# Build complete final chunk with all OpenAI fields # Build complete final chunk with all OpenAI fields
final_chunk = { final_chunk = {
"id": completion_id, "id": completion_id,
...@@ -1468,6 +1495,7 @@ async def stream_chat_response( ...@@ -1468,6 +1495,7 @@ async def stream_chat_response(
"prompt_tokens": prompt_tokens, "prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens, "completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens, "total_tokens": prompt_tokens + completion_tokens,
"context_size": context_size,
"prompt_tokens_details": { "prompt_tokens_details": {
"cached_tokens": 0, "cached_tokens": 0,
"audio_tokens": 0, "audio_tokens": 0,
...@@ -1633,13 +1661,17 @@ async def generate_chat_response( ...@@ -1633,13 +1661,17 @@ async def generate_chat_response(
prompt_tokens = len(prompt_text.split()) prompt_tokens = len(prompt_text.split())
completion_tokens = len(generated_text.split()) if generated_text else 0 completion_tokens = len(generated_text.split()) if generated_text else 0
# Get context size
context_size = current_manager.get_context_size()
# Use OpenAIFormatter for final sanitization # Use OpenAIFormatter for final sanitization
formatter = OpenAIFormatter(model_name) formatter = OpenAIFormatter(model_name)
formatted_response = formatter.format_litellm_full( formatted_response = formatter.format_litellm_full(
text=response_message.get("content", ""), text=response_message.get("content", ""),
prompt_tokens=prompt_tokens, prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens, completion_tokens=completion_tokens,
tool_calls=response_message.get("tool_calls") tool_calls=response_message.get("tool_calls"),
context_size=context_size
) )
# Add mock reasoning stats if 'mock' is in force_reasoning_args # Add mock reasoning stats if 'mock' is in force_reasoning_args
...@@ -1765,6 +1797,7 @@ async def stream_completion_response( ...@@ -1765,6 +1797,7 @@ async def stream_completion_response(
"""Stream legacy completion response.""" """Stream legacy completion response."""
completion_id = f"cmpl-{uuid.uuid4().hex}" completion_id = f"cmpl-{uuid.uuid4().hex}"
created = int(time.time()) created = int(time.time())
generated_text = ""
try: try:
async for chunk in current_manager.generate_stream( async for chunk in current_manager.generate_stream(
...@@ -1774,6 +1807,7 @@ async def stream_completion_response( ...@@ -1774,6 +1807,7 @@ async def stream_completion_response(
top_p=top_p, top_p=top_p,
stop=stop, stop=stop,
): ):
generated_text += chunk
data = { data = {
"id": completion_id, "id": completion_id,
"object": "text_completion", "object": "text_completion",
...@@ -1788,7 +1822,37 @@ async def stream_completion_response( ...@@ -1788,7 +1822,37 @@ async def stream_completion_response(
} }
yield f"data: {json.dumps(data)}\n\n" yield f"data: {json.dumps(data)}\n\n"
yield f"data: {json.dumps({'choices': [{'finish_reason': 'stop'}]})}\n\n" # Calculate token counts
if current_manager.tokenizer:
prompt_tokens = len(current_manager.tokenizer.encode(prompt))
completion_tokens = len(current_manager.tokenizer.encode(generated_text))
else:
prompt_tokens = len(prompt.split())
completion_tokens = len(generated_text.split())
# Get context size
context_size = current_manager.get_context_size()
# Send final chunk with usage
final_chunk = {
"id": completion_id,
"object": "text_completion",
"created": created,
"model": model_name,
"choices": [{
"text": "",
"index": 0,
"logprobs": None,
"finish_reason": "stop",
}],
"usage": {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
"context_size": context_size,
},
}
yield f"data: {json.dumps(final_chunk)}\n\n"
yield "data: [DONE]\n\n" yield "data: [DONE]\n\n"
except Exception as e: except Exception as e:
print(f"Error during streaming completion: {e}") print(f"Error during streaming completion: {e}")
...@@ -1825,6 +1889,9 @@ async def generate_completion_response( ...@@ -1825,6 +1889,9 @@ async def generate_completion_response(
prompt_tokens = len(prompt.split()) prompt_tokens = len(prompt.split())
completion_tokens = len(generated_text.split()) completion_tokens = len(generated_text.split())
# Get context size
context_size = current_manager.get_context_size()
return { return {
"id": completion_id, "id": completion_id,
"object": "text_completion", "object": "text_completion",
...@@ -1840,6 +1907,7 @@ async def generate_completion_response( ...@@ -1840,6 +1907,7 @@ async def generate_completion_response(
"prompt_tokens": prompt_tokens, "prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens, "completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens, "total_tokens": prompt_tokens + completion_tokens,
"context_size": context_size,
}, },
} }
except Exception as e: except Exception as e:
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Audio transcription endpoint for the codai API. Audio transcription endpoint for the codai API.
""" """
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Text-to-speech endpoints for the codai API. Text-to-speech endpoints for the codai API.
""" """
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Video generation and manipulation endpoints for the codai API. Video generation and manipulation endpoints for the codai API.
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Backend detection and management module.""" """Backend detection and management module."""
from codai.backends.base import ModelBackend from codai.backends.base import ModelBackend
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Base classes for model backends.""" """Base classes for model backends."""
from abc import ABC, abstractmethod from abc import ABC, abstractmethod
...@@ -46,3 +62,7 @@ class ModelBackend(ABC): ...@@ -46,3 +62,7 @@ class ModelBackend(ABC):
def cleanup(self) -> None: def cleanup(self) -> None:
"""Cleanup resources.""" """Cleanup resources."""
pass pass
def get_context_size(self) -> int:
"""Return the model's context window size."""
return 2048 # Default fallback
\ No newline at end of file
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""CUDA backend using HuggingFace Transformers.""" """CUDA backend using HuggingFace Transformers."""
import os import os
...@@ -868,3 +884,13 @@ class NvidiaBackend(ModelBackend): ...@@ -868,3 +884,13 @@ class NvidiaBackend(ModelBackend):
self.tokenizer = None self.tokenizer = None
if torch.cuda.is_available(): if torch.cuda.is_available():
torch.cuda.empty_cache() torch.cuda.empty_cache()
def get_context_size(self) -> int:
"""Return the model's context window size."""
if self.model is not None and hasattr(self.model, 'config'):
config = self.model.config
# Try different attribute names used by different models
for attr in ['max_position_embeddings', 'n_positions', 'max_seq_length', 'seq_length']:
if hasattr(config, attr):
return getattr(config, attr)
return 2048 # Default fallback
\ No newline at end of file
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# AI.PROMPT: Add Vulkan backend support for AMD GPUs using llama-cpp-python # AI.PROMPT: Add Vulkan backend support for AMD GPUs using llama-cpp-python
# This backend handles GGUF models on AMD GPUs via Vulkan # This backend handles GGUF models on AMD GPUs via Vulkan
...@@ -932,3 +948,7 @@ class VulkanBackend(ModelBackend): ...@@ -932,3 +948,7 @@ class VulkanBackend(ModelBackend):
def cleanup(self) -> None: def cleanup(self) -> None:
"""Cleanup resources.""" """Cleanup resources."""
self.unload_model() self.unload_model()
def get_context_size(self) -> int:
"""Return the model's context window size."""
return self.n_ctx
\ No newline at end of file
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Command-line argument parsing for codai server.""" """Command-line argument parsing for codai server."""
import argparse import argparse
import json import json
...@@ -209,4 +225,3 @@ configuration directory (--config DIR, default: ~/.coderai/). Key files: ...@@ -209,4 +225,3 @@ configuration directory (--config DIR, default: ~/.coderai/). Key files:
help="List available Vulkan GPU devices and exit", help="List available Vulkan GPU devices and exit",
) )
return parser.parse_args() return parser.parse_args()
\ No newline at end of file
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Configuration management for coderai.""" """Configuration management for coderai."""
import json import json
import os import os
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Main entry point for codai server.""" """Main entry point for codai server."""
import sys import sys
import os import os
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
# codai.models - Model parsing and templates # codai.models - Model parsing and templates
from .manager import ( from .manager import (
ModelManager, ModelManager,
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Model Cache - Unified model loading, caching, downloading, and management. Model Cache - Unified model loading, caching, downloading, and management.
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Model capabilities module.""" """Model capabilities module."""
from dataclasses import dataclass from dataclasses import dataclass
...@@ -61,6 +77,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -61,6 +77,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
""" """
Detect model capabilities from the model name/ID. Detect model capabilities from the model name/ID.
Heuristic only — actual capabilities depend on the checkpoint. Heuristic only — actual capabilities depend on the checkpoint.
Returns all detected capabilities (multimodal models may have multiple).
""" """
caps = ModelCapabilities() caps = ModelCapabilities()
if not model_name: if not model_name:
...@@ -74,10 +91,12 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -74,10 +91,12 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
'animatediff', 'text2video', 'modelscope-t2v', 'animatediff', 'text2video', 'modelscope-t2v',
'zeroscope', 'lavie']): 'zeroscope', 'lavie']):
caps.video_generation = True caps.video_generation = True
caps.text_generation = True # T2V models also do text
return caps return caps
if any(x in n for x in ['wan2.1-t2v', 'wan-t2v']): if any(x in n for x in ['wan2.1-t2v', 'wan-t2v']):
caps.video_generation = True caps.video_generation = True
caps.text_generation = True
return caps return caps
# Image-to-video # Image-to-video
...@@ -86,12 +105,17 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -86,12 +105,17 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
'wan2.1-i2v', 'wan-i2v', 'img2vid', 'wan2.1-i2v', 'wan-i2v', 'img2vid',
'image2video', 'motionctrl']): 'image2video', 'motionctrl']):
caps.image_to_video = True caps.image_to_video = True
caps.image_to_text = True # I2V models process images
return caps return caps
# Wan generic (detect sub-variant) # Wan generic (detect sub-variant)
if 'wan' in n and ('video' in n or 'diffuser' in n): if 'wan' in n and ('video' in n or 'diffuser' in n):
caps.image_to_video = True if 'i2v' in n else False if 'i2v' in n:
caps.video_generation = True if 'i2v' not in n else False caps.image_to_video = True
caps.image_to_text = True
else:
caps.video_generation = True
caps.text_generation = True
return caps return caps
# Video interpolation # Video interpolation
...@@ -115,6 +139,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -115,6 +139,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
if any(x in n for x in ['musicgen', 'audiogen', 'audioldm', 'stable-audio', if any(x in n for x in ['musicgen', 'audiogen', 'audioldm', 'stable-audio',
'mustango', 'noise2music', 'jukebox', 'audiocraft']): 'mustango', 'noise2music', 'jukebox', 'audiocraft']):
caps.audio_generation = True caps.audio_generation = True
caps.text_generation = True # T2A models process text
return caps return caps
if any(x in n for x in ['demucs', 'spleeter', 'asteroid', 'open-unmix']): if any(x in n for x in ['demucs', 'spleeter', 'asteroid', 'open-unmix']):
...@@ -130,11 +155,14 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -130,11 +155,14 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
if any(x in n for x in ['kokoro', 'xtts', 'bark', 'tortoise', if any(x in n for x in ['kokoro', 'xtts', 'bark', 'tortoise',
'speecht5', 'matcha-tts', 'voicebox']): 'speecht5', 'matcha-tts', 'voicebox']):
caps.text_to_speech = True caps.text_to_speech = True
caps.text_generation = True # TTS models process text
return caps return caps
# Lip sync / dubbing # Lip sync / dubbing
if any(x in n for x in ['wav2lip', 'sadtalker', 'dinet', 'videoretalking']): if any(x in n for x in ['wav2lip', 'sadtalker', 'dinet', 'videoretalking']):
caps.lip_sync = True caps.lip_sync = True
caps.audio_generation = True
caps.video_generation = True
return caps return caps
# ── Image: generation ──────────────────────────────────────────────────── # ── Image: generation ────────────────────────────────────────────────────
...@@ -142,11 +170,13 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -142,11 +170,13 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
caps.inpainting = True caps.inpainting = True
caps.image_generation = True caps.image_generation = True
caps.image_to_image = True caps.image_to_image = True
caps.text_generation = True # T2I models process text
return caps return caps
if 'controlnet' in n: if 'controlnet' in n:
caps.controlnet = True caps.controlnet = True
caps.image_generation = True caps.image_generation = True
caps.text_generation = True
return caps return caps
if any(x in n for x in ['stable-diffusion', 'sd15', 'sdxl', 'sd-xl', if any(x in n for x in ['stable-diffusion', 'sd15', 'sdxl', 'sd-xl',
...@@ -156,31 +186,37 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -156,31 +186,37 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
caps.image_generation = True caps.image_generation = True
caps.image_to_image = True caps.image_to_image = True
caps.inpainting = True # most SD/SDXL/Flux support inpainting variant caps.inpainting = True # most SD/SDXL/Flux support inpainting variant
caps.text_generation = True # T2I models process text
return caps return caps
# ── Image: analysis / processing ───────────────────────────────────────── # ── Image: analysis / processing ─────────────────────────────────────────
if any(x in n for x in ['midas', 'dpt-depth', 'dpt-large', 'zoe-depth', if any(x in n for x in ['midas', 'dpt-depth', 'dpt-large', 'zoe-depth',
'depth-anything', 'marigold']): 'depth-anything', 'marigold']):
caps.depth_estimation = True caps.depth_estimation = True
caps.image_to_text = True # Image analysis models process images
return caps return caps
if any(x in n for x in ['sam2', 'sam-', '-sam', 'segment-anything', if any(x in n for x in ['sam2', 'sam-', '-sam', 'segment-anything',
'mask-rcnn', 'fastsam']): 'mask-rcnn', 'fastsam']):
caps.image_segmentation = True caps.image_segmentation = True
caps.image_to_text = True
return caps return caps
if any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr', if any(x in n for x in ['real-esrgan', 'esrgan', 'swinir', 'edsr',
'bsrgan', 'hat-', 'dat-']): 'bsrgan', 'hat-', 'dat-']):
caps.image_upscaling = True caps.image_upscaling = True
caps.image_to_image = True
return caps return caps
if any(x in n for x in ['codeformer', 'gfpgan', 'restoreformer']): if any(x in n for x in ['codeformer', 'gfpgan', 'restoreformer']):
caps.face_restoration = True caps.face_restoration = True
caps.image_upscaling = True caps.image_upscaling = True
caps.image_to_image = True
return caps return caps
if any(x in n for x in ['yolo', 'detr', 'owlvit', 'rtdetr', 'dino']): if any(x in n for x in ['yolo', 'detr', 'owlvit', 'rtdetr', 'dino']):
caps.object_detection = True caps.object_detection = True
caps.image_to_text = True
return caps return caps
# ── Vision / multimodal LLMs ───────────────────────────────────────────── # ── Vision / multimodal LLMs ─────────────────────────────────────────────
...@@ -197,6 +233,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities: ...@@ -197,6 +233,7 @@ def detect_model_capabilities(model_name: str) -> ModelCapabilities:
'sentence-transformer', 'nomic-embed', 'sentence-transformer', 'nomic-embed',
'instructor-', 'gte-', 'jina-embed']): 'instructor-', 'gte-', 'jina-embed']):
caps.embeddings = True caps.embeddings = True
caps.text_generation = True # Embedding models process text
return caps return caps
# ── GGUF quantised text models ─────────────────────────────────────────── # ── GGUF quantised text models ───────────────────────────────────────────
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Grammar loading utilities for grammar-guided generation.""" """Grammar loading utilities for grammar-guided generation."""
import os import os
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Model manager module - contains ModelManager, WhisperServerManager, and MultiModelManager classes.""" """Model manager module - contains ModelManager, WhisperServerManager, and MultiModelManager classes."""
from typing import Optional, Dict, Any, List from typing import Optional, Dict, Any, List
...@@ -212,6 +228,12 @@ class ModelManager: ...@@ -212,6 +228,12 @@ class ModelManager:
return self.backend.tokenizer return self.backend.tokenizer
return None return None
def get_context_size(self) -> int:
"""Get the model's context window size."""
if self.backend is not None:
return self.backend.get_context_size()
return 2048 # Default fallback
def cleanup(self): def cleanup(self):
if self.backend is not None: if self.backend is not None:
self.backend.cleanup() self.backend.cleanup()
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Model Parser Dispatcher - Multi-Model Tool Call Parsing Model Parser Dispatcher - Multi-Model Tool Call Parsing
...@@ -1173,10 +1189,15 @@ class OpenAIFormatter: ...@@ -1173,10 +1189,15 @@ class OpenAIFormatter:
self.model_name = model_name self.model_name = model_name
self.id = f"chatcmpl-{uuid.uuid4()}" self.id = f"chatcmpl-{uuid.uuid4()}"
def format_full(self, text, prompt_tokens, completion_tokens, tool_calls=None, reasoning=None): def format_full(self, text, prompt_tokens, completion_tokens, tool_calls=None, reasoning=None, context_size=None):
"""Standard Response (Non-Streaming)""" """Standard Response (Non-Streaming)"""
if LITELLM_AVAILABLE and all([ModelResponse, Choices, Message, Usage]): if LITELLM_AVAILABLE and all([ModelResponse, Choices, Message, Usage]):
try: try:
usage_dict = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens
}
return ModelResponse( return ModelResponse(
id=self.id, id=self.id,
model=self.model_name, model=self.model_name,
...@@ -1187,11 +1208,7 @@ class OpenAIFormatter: ...@@ -1187,11 +1208,7 @@ class OpenAIFormatter:
index=0, index=0,
message=Message(content=text if not tool_calls else None, role="assistant", tool_calls=tool_calls) message=Message(content=text if not tool_calls else None, role="assistant", tool_calls=tool_calls)
)], )],
usage=Usage( usage=Usage(**usage_dict)
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=prompt_tokens + completion_tokens
)
).model_dump() ).model_dump()
except Exception as e: except Exception as e:
print(f"DEBUG formatter: litellm fallback failed: {e}") print(f"DEBUG formatter: litellm fallback failed: {e}")
...@@ -1212,24 +1229,28 @@ class OpenAIFormatter: ...@@ -1212,24 +1229,28 @@ class OpenAIFormatter:
"finish_reason": "tool_calls" if tool_calls else "stop", "finish_reason": "tool_calls" if tool_calls else "stop",
} }
usage = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
}
if context_size is not None:
usage["context_size"] = context_size
return { return {
"id": self.id, "id": self.id,
"object": "chat.completion", "object": "chat.completion",
"created": int(time.time()), "created": int(time.time()),
"model": self.model_name, "model": self.model_name,
"choices": [choice], "choices": [choice],
"usage": { "usage": usage,
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens,
},
"provider": { "provider": {
"provider_name": "coderai", "provider_name": "coderai",
"provider_id": "coderai", "provider_id": "coderai",
}, },
} }
def format_chunk(self, delta_text, is_final=False, usage=None): def format_chunk(self, delta_text, is_final=False, usage=None, context_size=None):
"""Streaming Chunk (Used in a Generator)""" """Streaming Chunk (Used in a Generator)"""
if LITELLM_AVAILABLE and all([ChatCompletionChunk, StreamingChoices, Delta, (Usage if usage else True)]): if LITELLM_AVAILABLE and all([ChatCompletionChunk, StreamingChoices, Delta, (Usage if usage else True)]):
try: try:
...@@ -1270,21 +1291,23 @@ class OpenAIFormatter: ...@@ -1270,21 +1291,23 @@ class OpenAIFormatter:
if usage and is_final: if usage and is_final:
chunk["usage"] = usage chunk["usage"] = usage
if context_size is not None:
chunk["usage"]["context_size"] = context_size
return chunk return chunk
def format_final_chunk(self, usage: dict = None) -> dict: def format_final_chunk(self, usage: dict = None, context_size: int = None) -> dict:
"""Format the final streaming chunk with usage information.""" """Format the final streaming chunk with usage information."""
return self.format_chunk("", is_final=True, usage=usage) return self.format_chunk("", is_final=True, usage=usage, context_size=context_size)
# Backward compatibility methods # Backward compatibility methods
def format_litellm_full(self, text: str, prompt_tokens: int, completion_tokens: int, tool_calls=None) -> dict: def format_litellm_full(self, text: str, prompt_tokens: int, completion_tokens: int, tool_calls=None, context_size=None) -> dict:
"""Backward compatibility method - calls format_full.""" """Backward compatibility method - calls format_full."""
return self.format_full(text, prompt_tokens, completion_tokens, tool_calls) return self.format_full(text, prompt_tokens, completion_tokens, tool_calls, context_size=context_size)
def format_litellm_chunk(self, delta_text: str, is_final: bool = False, usage: dict = None) -> dict: def format_litellm_chunk(self, delta_text: str, is_final: bool = False, usage: dict = None, context_size: int = None) -> dict:
"""Backward compatibility method - calls format_chunk.""" """Backward compatibility method - calls format_chunk."""
return self.format_chunk(delta_text, is_final, usage) return self.format_chunk(delta_text, is_final, usage, context_size)
# ============================================================================= # =============================================================================
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
Agentic Template Manager for forcing reasoning in LLM agents. Agentic Template Manager for forcing reasoning in LLM agents.
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Utility functions for model handling.""" """Utility functions for model handling."""
from typing import Optional, Any from typing import Optional, Any
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for audio generation API.""" """Pydantic models for audio generation API."""
from typing import Dict, List, Optional from typing import Dict, List, Optional
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for embeddings API.""" """Pydantic models for embeddings API."""
from typing import Dict, List, Optional, Union from typing import Dict, List, Optional, Union
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for image generation API.""" """Pydantic models for image generation API."""
from typing import Dict, List, Optional from typing import Dict, List, Optional
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for API.""" """Pydantic models for API."""
import time import time
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for transcription API.""" """Pydantic models for transcription API."""
from typing import List, Optional from typing import List, Optional
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Pydantic models for video generation API.""" """Pydantic models for video generation API."""
from typing import Dict, List, Optional from typing import Dict, List, Optional
......
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
"""Queue manager module - manages request queues for model loading notifications.""" """Queue manager module - manages request queues for model loading notifications."""
from typing import Dict, Optional from typing import Dict, Optional
......
#!/usr/bin/env python3 #!/usr/bin/env python3
# CoderAI - OpenAI-compatible API server
# Copyright (C) 2026 Stefy Lanza <stefy@nexlab.net>
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program. If not, see <https://www.gnu.org/licenses/>.
""" """
OpenAI-compatible API server for HuggingFace models (NVIDIA) and GGUF models (Vulkan). OpenAI-compatible API server for HuggingFace models (NVIDIA) and GGUF models (Vulkan).
Supports CUDA (NVIDIA) and Vulkan (AMD) GPU backends, memory-aware model loading, Supports CUDA (NVIDIA) and Vulkan (AMD) GPU backends, memory-aware model loading,
......
# README Update - 2026-05-05
## Summary
Updated the README.md to reflect the current configuration-based architecture implemented in the 2026-05-03 refactoring. The README was outdated and still documented the old CLI-heavy approach with numerous command-line flags.
## Key Changes
### 1. Updated Feature Section
- Reorganized into three subsections: Core Capabilities, GPU Backend Support, Advanced Features
- Emphasized the web admin dashboard and configuration-based approach
- Highlighted multi-modal support (text, image, audio, TTS)
- Added per-model configuration as a key feature
### 2. Installation Section
- Updated build script examples to show `./build.sh all` option
- Clarified that `all` installs support for all backends
- Maintained backward compatibility with `nvidia` and `vulkan` options
### 3. Usage Section - Major Overhaul
- **Removed**: All old CLI examples with `--model`, `--backend`, `--load-in-4bit`, etc.
- **Added**:
- Quick start guide with simple `python coderai` command
- Access points (Admin Dashboard, Chat Interface, API, Docs)
- First login credentials
- Configuration files overview
- Updated command-line options (only `--config`, `--debug`, `--dump`, model management, and utility flags)
### 4. Configuration Section - New Structure
- Added comprehensive configuration file examples:
- `config.json` - Server, backend, and global settings
- `models.json` - Model registry with per-model configurations
- `auth.json` - Users, API tokens, and sessions
- Added "Managing Configuration" subsection:
- Via Web Dashboard (recommended)
- Via Configuration Files (manual editing)
- Added "Per-Model Configuration" with detailed settings for each backend
- Added "Backend Selection" and "Model Loading Modes" subsections
### 5. Backend-Specific Setup - Restructured
- **NVIDIA (CUDA)**: Removed CLI examples, added `models.json` configuration example
- **AMD and Intel (Vulkan)**: Removed CLI examples, added `models.json` and `config.json` configuration examples
- **CPU-Only**: Updated to show configuration-based approach
- **Low VRAM Configuration**: Changed from CLI flags to config file examples (global and per-model)
- **Multi-GPU with Vulkan**: Updated to use `config.json` settings instead of CLI flags
### 6. Removed Sections
- Removed "Reply Filters" section (not in current CLI)
- Removed "HuggingFace Chat Template" section (not in current CLI)
- Removed "Backend Selection" CLI examples
- Removed "Model Formats by Backend" CLI examples
- Removed all "Examples" subsection with CLI commands
### 7. Maintained Sections
- API Documentation (unchanged - still valid)
- Model Recommendations (unchanged - still valid)
- Troubleshooting (unchanged - examples are still helpful)
- License, Contributing, Acknowledgments (unchanged)
## Architecture Documented
### Before (Old README)
```
Command Line (many flags) → main.py → FastAPI API
```
### After (Updated README)
```
~/.coderai/
├── config.json # Server, backend, global settings
├── models.json # Per-model configs
├── auth.json # Users, tokens, sessions
└── secret_key # Session signing key
ConfigManager → main.py → FastAPI (API + Admin UI + Chat)
```
## User Experience Improvements
1. **Simpler Getting Started**: Users now just run `python coderai` instead of memorizing complex CLI flags
2. **Web-Based Management**: All configuration through the admin dashboard at `http://localhost:8000/admin`
3. **Persistent Configuration**: Settings saved in JSON files, no need to remember CLI arguments
4. **Per-Model Settings**: Each model can have its own configuration (GPU layers, quantization, context size)
5. **Better Documentation**: Clear separation between installation, usage, and configuration
## Files Modified
- `/storage/coderai/README.md` - Complete overhaul (~1009 lines)
## Validation
- ✅ All sections updated to reflect configuration-based architecture
- ✅ Removed outdated CLI examples
- ✅ Added comprehensive configuration examples
- ✅ Maintained valid troubleshooting and model recommendation sections
- ✅ Preserved license and acknowledgments
- ✅ Structure is clear and easy to navigate
## Next Steps
Users should now:
1. Run `./build.sh all` to install
2. Run `python coderai` to start
3. Visit `http://localhost:8000/admin` to configure
4. Use the web dashboard for all model and settings management
No more memorizing CLI flags!
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment