# Multimodal Model Capability Indicators - Implementation Summary
## Overview
Added comprehensive multimodal capability detection and display throughout CoderAI's UI, making it easy to identify models that support multiple modalities (text, image, video, audio) before downloading and when browsing the local cache.
An OpenAI-compatible API server supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Uses HuggingFace Transformers for NVIDIA GPUs and llama-cpp-python with Vulkan for AMD/Intel GPUs.
An OpenAI-compatible API server with web administration dashboard, supporting multiple GPU backends: NVIDIA (CUDA), AMD (Vulkan), and Intel (Vulkan). Configuration-driven architecture with per-model settings and multi-modal support (text, image, audio, TTS).
## Features
-**Multi-Backend Support**:
- NVIDIA (CUDA) via PyTorch + Transformers
- AMD GPUs via llama-cpp-python + Vulkan
- Intel GPUs (iGPU/Arc) via llama-cpp-python + Vulkan
### Core Capabilities
-**OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints
-**Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM (NVIDIA)
-**Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed (NVIDIA)
-**Multi-GPU Support**: Automatic distribution across multiple CUDA devices (NVIDIA)
-**GPU Auto-Detection**: Automatically detects available backends
-**Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes (NVIDIA) or built-in GGUF quantization (Vulkan)
-**Streaming Responses**: Server-sent events for real-time token generation
-**Tool Calling**: Support for function calling and tool use
-**Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models`
-**Web Admin Dashboard**: Modern UI for model management, user authentication, and API tokens
-**Configuration-Based**: JSON config files for all settings - no complex CLI arguments
-**Multi-Modal Support**: Text generation, image generation, audio transcription, text-to-speech
-**Per-Model Configuration**: Individual settings for each model (GPU layers, quantization, context size)
-**On-Demand Loading**: Models load automatically when requested, unload when idle
### GPU Backend Support
-**NVIDIA (CUDA)**: PyTorch + Transformers for HuggingFace models
-**AMD GPUs**: llama-cpp-python + Vulkan for GGUF models
-**Intel GPUs**: iGPU/Arc support via Vulkan
-**Auto-Detection**: Automatically selects best available backend
-**Multi-GPU**: Automatic distribution across multiple devices
### Advanced Features
-**Memory Management**: Smart VRAM → RAM → Disk offloading (NVIDIA)
-**Quantization**: 4-bit/8-bit via bitsandbytes (NVIDIA) or GGUF quantization (Vulkan)
-**Flash Attention 2**: Optional faster inference for supported NVIDIA GPUs
-**Streaming**: Server-sent events for real-time token generation
-**Tool Calling**: Function calling and tool use support
-**Authentication**: Session-based auth with API token support
## Installation
...
...
@@ -44,19 +52,20 @@ The easiest way to install is using the provided build script:
git clone git@git.nexlab.net:nexlab/coderai.git
cd coderai
# For NVIDIA GPUs (default)
./build.sh nvidia
# Install all backends (recommended)
./build.sh all
# For AMD or Intel GPUs with Vulkan support
./build.sh vulkan
# Or install specific backend:
./build.sh nvidia # NVIDIA GPUs only
./build.sh vulkan # AMD/Intel GPUs only
```
**Note**: The `vulkan` option works for both AMD and Intel GPUs.
**Note**: The `all` option installs support for all backends, allowing you to switch between them via configuration. The `vulkan` option works for both AMD and Intel GPUs.
The build script will:
- Create a virtual environment
- Install the appropriate dependencies for your GPU
CoderAI uses JSON configuration files stored in `~/.coderai/` (or custom directory via `--config`):
# Apply all filters to specific model
coderai --reply-filters text:llama-3.1:all
```
**Filter Syntax Reference:**
| Syntax | Applies To |
|--------|------------|
| `all` | All models, all filters |
| `malformed` | All models, malformed filter |
| `tool_calls` | All models, tool_calls filter |
| `text:malformed` | All text models, malformed filter |
| `image:tool_calls` | All image models, tool_calls filter |
| `text:model_name:malformed` | Specific text model, malformed filter |
| `image:model_name:tool_calls` | Specific image model, tool_calls filter |
### HuggingFace Chat Template
The `--hf-chat-template` option enables using HuggingFace's `apply_chat_template` from the transformers library for GGUF models instead of llama.cpp's built-in chat template handling. This provides more consistent chat template formatting that matches HuggingFace models.
**Requirements:**
-`transformers` library must be installed
- The model must be available on HuggingFace Hub or have a `tokenizer_config.json` in the same directory as the GGUF file
**Usage:**
```bash
# Auto-detect and use HuggingFace chat template for all models
coderai --hf-chat-template auto --model llama-3.1-8b-instruct-q4_k_m.gguf
# Auto-detect for all text models
coderai --hf-chat-template text --model llama-3.1-8b-instruct-q4_k_m.gguf
### Using Vulkan with Multiple GPUs (NVIDIA + AMD)
If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:
If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU, configure in `config.json`:
**Method 1: Use `--vulkan-single-gpu` flag (Recommended)**
```bash
# Force all layers onto the specified GPU device only
# Or hide NVIDIA GPU from CUDA (prevents any CUDA usage)
CUDA_VISIBLE_DEVICES="" python coderai
```
**Understanding the Issue:**
When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `--vulkan-single-gpu` flag prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.
When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `single_gpu: true` setting prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.
**Notes:**
- The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python
- The `--vulkan-single-gpu` flag builds a `tensor_split` array to force single GPU usage
- The `device_id` setting maps to `main_gpu` in llama-cpp-python
- The `single_gpu` flag builds a `tensor_split` array to force single GPU usage
- Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs
- The `vulkaninfo` command shows all GPUs visible to Vulkan
...
...
@@ -608,7 +710,7 @@ Multiple GPUs are automatically detected and utilized. The model will be distrib
export CUDA_VISIBLE_DEVICES=0,1,2,3
# Run - model will be distributed across all visible GPUs
Updated the README.md to reflect the current configuration-based architecture implemented in the 2026-05-03 refactoring. The README was outdated and still documented the old CLI-heavy approach with numerous command-line flags.
## Key Changes
### 1. Updated Feature Section
- Reorganized into three subsections: Core Capabilities, GPU Backend Support, Advanced Features
- Emphasized the web admin dashboard and configuration-based approach
- Highlighted multi-modal support (text, image, audio, TTS)
- Added per-model configuration as a key feature
### 2. Installation Section
- Updated build script examples to show `./build.sh all` option
- Clarified that `all` installs support for all backends
- Maintained backward compatibility with `nvidia` and `vulkan` options
### 3. Usage Section - Major Overhaul
-**Removed**: All old CLI examples with `--model`, `--backend`, `--load-in-4bit`, etc.
-**Added**:
- Quick start guide with simple `python coderai` command