Add Vulkan support for AMD GPUs alongside NVIDIA/CUDA

- Add build.sh script with nvidia/vulkan arguments (default: nvidia)
- Create backend abstraction: ModelBackend base class
- Implement NvidiaBackend using HuggingFace Transformers
- Implement VulkanBackend using llama-cpp-python with GGUF models
- Add separate requirements files for nvidia and vulkan backends
- Add --backend argument with auto/nvidia/vulkan options
- Add Vulkan-specific options: --n-gpu-layers, --n-ctx
- Make procname import optional
- Update README with comprehensive Vulkan usage instructions
- Add Vulkan troubleshooting section
- Add GGUF model recommendations

The application now supports:
- NVIDIA GPUs via PyTorch/Transformers (HuggingFace models)
- AMD GPUs via llama-cpp-python/Vulkan (GGUF models)
parent ae1d0e38
# CoderAI # CoderAI
An OpenAI-compatible API server for HuggingFace models with intelligent memory management, GPU auto-detection, and advanced features like tool calling and streaming. An OpenAI-compatible API server supporting both NVIDIA (CUDA) and AMD (Vulkan) GPUs. Uses HuggingFace Transformers for NVIDIA GPUs and llama-cpp-python with Vulkan for AMD GPUs.
## Features ## Features
- **Dual Backend Support**: NVIDIA (CUDA) via PyTorch + Transformers, AMD (Vulkan) via llama-cpp-python
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints - **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints
- **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM - **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM (NVIDIA)
- **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed - **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed (NVIDIA)
- **Multi-GPU Support**: Automatic distribution across multiple CUDA/ROCm devices - **Multi-GPU Support**: Automatic distribution across multiple CUDA devices (NVIDIA)
- **GPU Auto-Detection**: Automatically detects CUDA (NVIDIA) or ROCm (AMD) GPUs - **GPU Auto-Detection**: Automatically detects available backends
- **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes for reduced memory usage - **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes (NVIDIA) or built-in GGUF quantization (Vulkan)
- **Flash Attention 2**: Optional faster attention implementation for supported GPUs - **Flash Attention 2**: Optional faster attention implementation for supported NVIDIA GPUs
- **Streaming Responses**: Server-sent events for real-time token generation - **Streaming Responses**: Server-sent events for real-time token generation
- **Tool Calling**: Support for function calling and tool use - **Tool Calling**: Support for function calling and tool use
- **Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models` - **Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models`
...@@ -21,68 +22,81 @@ An OpenAI-compatible API server for HuggingFace models with intelligent memory m ...@@ -21,68 +22,81 @@ An OpenAI-compatible API server for HuggingFace models with intelligent memory m
- Python 3.8+ - Python 3.8+
- For NVIDIA GPUs: CUDA toolkit (11.8+ recommended) - For NVIDIA GPUs: CUDA toolkit (11.8+ recommended)
- For AMD GPUs: ROCm (5.6+ recommended, 6.0+ preferred) - For AMD GPUs (Vulkan): Vulkan drivers and SDK
- For CPU-only: No additional requirements - For CPU-only: No additional requirements
### Basic Installation ### Quick Install with Build Script
The easiest way to install is using the provided build script:
```bash ```bash
# Clone the repository # Clone the repository
git clone git@git.nexlab.net:nexlab/coderai.git git clone git@git.nexlab.net:nexlab/coderai.git
cd coderai cd coderai
# Create virtual environment (recommended) # For NVIDIA GPUs (default)
python -m venv venv ./build.sh nvidia
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install base requirements # For AMD GPUs with Vulkan support
pip install -r requirements.txt ./build.sh vulkan
``` ```
### Platform-Specific PyTorch Installation The build script will:
- Create a virtual environment
- Install the appropriate dependencies for your GPU
- Set up the correct backend
PyTorch installation varies by platform. Uncomment the appropriate section in [`requirements.txt`](requirements.txt) or install manually: ### Manual Installation
> **⚠️ WARNING: Shell Redirection Issue** If you prefer manual installation:
> When using `>=` in pip commands, always use **quotes** around the package specifier!
> Without quotes, the shell interprets `>` as output redirection.
>
> ❌ Wrong: `pip install torch>=2.0.0` (creates file named "=2.0.0")
> ✅ Correct: `pip install "torch>=2.0.0"` (with quotes)
> ✅ Also correct: `pip install torch==2.0.0` (exact version, no >=)
#### NVIDIA (CUDA)
```bash ```bash
# For CUDA 11.8 # Create virtual environment
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu118 python -m venv venv
source venv/bin/activate
# For CUDA 12.1 # For NVIDIA GPUs
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121 pip install torch torchvision torchaudio
pip install -r requirements-nvidia.txt
# For CUDA 12.4 (latest) # For AMD GPUs with Vulkan
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
pip install -r requirements-vulkan.txt
``` ```
#### AMD (ROCm) ### Platform-Specific Requirements
```bash #### NVIDIA (CUDA)
# For ROCm 6.0 (recommended for newer AMD GPUs)
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0
# For ROCm 5.6 (for older AMD GPUs) Requires:
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm5.6 - NVIDIA GPU with CUDA support
``` - CUDA toolkit (11.8+ or 12.1+)
- PyTorch with CUDA
Models: HuggingFace format (safetensors/pytorch)
> **Note**: ROCm 5.4.2 is deprecated. Use ROCm 5.6 or 6.0 for better compatibility. #### AMD (Vulkan)
> Check available versions at: https://pytorch.org/get-started/locally/
#### CPU Only Requires:
- AMD GPU with Vulkan support (RX 400 series and newer)
- Vulkan drivers and SDK
**Install Vulkan drivers:**
```bash ```bash
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu # Debian/Ubuntu
sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
# Fedora
sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers
# Arch Linux
sudo pacman -S vulkan-headers vulkan-icd-loader vulkan-radeon
``` ```
Models: GGUF format (from HuggingFace or local files)
**Note**: The Vulkan backend uses llama-cpp-python with GGUF models, which provides excellent performance on AMD GPUs without requiring ROCm.
### Optional Dependencies ### Optional Dependencies
#### bitsandbytes (Quantization) #### bitsandbytes (Quantization)
...@@ -116,8 +130,14 @@ pip install flash-attn --no-build-isolation ...@@ -116,8 +130,14 @@ pip install flash-attn --no-build-isolation
### Basic Usage ### Basic Usage
```bash ```bash
# Run with a specific model # Activate the virtual environment created by build.sh
python coderai --model microsoft/DialoGPT-medium source venv/bin/activate
# Run with NVIDIA backend (HuggingFace models)
python coderai --model microsoft/DialoGPT-medium --backend nvidia
# Run with Vulkan backend (GGUF models)
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# The server will start on http://0.0.0.0:8000 by default # The server will start on http://0.0.0.0:8000 by default
``` ```
...@@ -125,28 +145,68 @@ python coderai --model microsoft/DialoGPT-medium ...@@ -125,28 +145,68 @@ python coderai --model microsoft/DialoGPT-medium
### Command-Line Options ### Command-Line Options
``` ```
usage: coderai [-h] [--model MODEL] [--host HOST] [--port PORT] usage: coderai [-h] [--model MODEL] [--backend {auto,nvidia,vulkan}] [--host HOST]
[--offload-dir OFFLOAD_DIR] [--load-in-4bit] [--load-in-8bit] [--port PORT] [--offload-dir OFFLOAD_DIR] [--load-in-4bit]
[--ram RAM] [--flash-attn] [--load-in-8bit] [--ram RAM] [--flash-attn] [--n-gpu-layers N]
[--n-ctx N]
OpenAI-compatible API server with memory-aware model loading OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends
options: options:
-h, --help show this help message and exit -h, --help show this help message and exit
--model MODEL HuggingFace model name or path --model MODEL Model name or path. For NVIDIA: HuggingFace model.
For Vulkan: GGUF file path or HF repo
--backend {auto,nvidia,vulkan}
Backend to use: auto (detect), nvidia (CUDA), or
vulkan (AMD GPUs)
--host HOST Host to bind to (default: 0.0.0.0) --host HOST Host to bind to (default: 0.0.0.0)
--port PORT Port to bind to (default: 8000) --port PORT Port to bind to (default: 8000)
--offload-dir OFFLOAD_DIR --offload-dir OFFLOAD_DIR
Directory for disk offload when model doesn't fit in Directory for disk offload (NVIDIA only, default: ./offload)
VRAM+RAM (default: ./offload) --load-in-4bit Load model in 4-bit precision (NVIDIA only, requires bitsandbytes)
--load-in-4bit Load model in 4-bit precision (requires bitsandbytes) --load-in-8bit Load model in 8-bit precision (NVIDIA only, requires bitsandbytes)
--load-in-8bit Load model in 8-bit precision (requires bitsandbytes) --ram RAM Manually specify available RAM in GB (NVIDIA only)
--ram RAM Manually specify available RAM in GB (bypasses auto- --flash-attn Use Flash Attention 2 (NVIDIA only, requires flash-attn)
detection) --n-gpu-layers N Number of layers to offload to GPU (Vulkan only,
--flash-attn Use Flash Attention 2 for faster inference (requires default: -1 = all layers)
flash-attn package and compatible GPU) --n-ctx N Context window size (Vulkan only, default: 2048)
```
### Backend Selection
The `--backend` option controls which backend to use:
- **`auto`** (default): Automatically detects available backends, preferring NVIDIA if available
- **`nvidia`**: Use PyTorch + Transformers with CUDA (for NVIDIA GPUs)
- **`vulkan`**: Use llama-cpp-python with Vulkan (for AMD GPUs)
### Model Formats by Backend
#### NVIDIA Backend
Uses HuggingFace Transformers format:
```bash
python coderai --model microsoft/DialoGPT-medium --backend nvidia
python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia
``` ```
#### Vulkan Backend
Uses GGUF format (can be local files or downloaded from HuggingFace):
```bash
# Local GGUF file
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# Download from HuggingFace (auto-selects GGUF file)
python coderai --model microsoft/Phi-3-mini-4k-instruct-gguf --backend vulkan
# Specific GGUF file from repo
python coderai --model TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf --backend vulkan
```
**Finding GGUF models:**
- Search on HuggingFace: https://huggingface.co/models?search=gguf
- Popular collections: TheBloke, unsloth, bartowski
- Recommended quantization: Q4_K_M for best speed/quality balance
### Examples ### Examples
#### Run with 4-bit Quantization (Low VRAM) #### Run with 4-bit Quantization (Low VRAM)
...@@ -276,41 +336,72 @@ curl -X POST http://localhost:8000/v1/chat/completions \ ...@@ -276,41 +336,72 @@ curl -X POST http://localhost:8000/v1/chat/completions \
## Configuration for Different Setups ## Configuration for Different Setups
### CUDA (NVIDIA GPU) ### NVIDIA (CUDA)
```bash ```bash
# Install CUDA-enabled PyTorch # Using build script
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121 ./build.sh nvidia
# Or manually install CUDA-enabled PyTorch
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
pip install -r requirements-nvidia.txt
# Run with GPU acceleration (automatic) # Run with GPU acceleration
python coderai --model meta-llama/Llama-2-7b-chat-hf python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia
# Optional: Enable Flash Attention 2 for faster inference # Optional: Enable Flash Attention 2 for faster inference
python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia --flash-attn
``` ```
### ROCm (AMD GPU) ### AMD (Vulkan)
```bash ```bash
# Install ROCm-enabled PyTorch (use 6.0 for newer GPUs, 5.6 for older) # Install Vulkan drivers first
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0 # Debian/Ubuntu:
sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
# Using build script
./build.sh vulkan
# Run with GGUF model
python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
# Run with GPU acceleration (automatic) # Or download automatically from HuggingFace
python coderai --model meta-llama/Llama-2-7b-chat-hf python coderai --model TheBloke/Llama-2-7B-GGUF --backend vulkan
# Check ROCm detection in output # Control GPU layer offloading (default: -1 = all layers)
python coderai --model model.gguf --backend vulkan --n-gpu-layers 35
# Adjust context window (default: 2048)
python coderai --model model.gguf --backend vulkan --n-ctx 4096
``` ```
**Vulkan Backend Notes:**
- Uses GGUF format models (much smaller than full HuggingFace models)
- Q4_K_M quantization recommended for 4GB+ VRAM GPUs
- Q5_K_M or Q6_K for higher quality
- Works on AMD RX 400 series and newer
- Also works on NVIDIA GPUs but CUDA backend is preferred for NVIDIA
### CPU-Only ### CPU-Only
```bash While not recommended for performance, you can run on CPU:
# Install CPU-only PyTorch
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
# Run on CPU (automatic fallback) ```bash
python coderai --model microsoft/DialoGPT-medium # NVIDIA backend on CPU
pip install "torch>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements-nvidia.txt
python coderai --model microsoft/DialoGPT-medium --backend nvidia
# Or Vulkan backend on CPU (llama-cpp supports CPU fallback)
CMAKE_ARGS="-DGGML_VULKAN=OFF" pip install llama-cpp-python
python coderai --model model.gguf --backend vulkan
``` ```
### ROCm Alternative (deprecated)
While the Vulkan backend is now recommended for AMD GPUs, ROCm support is still available through the NVIDIA backend if you have ROCm-enabled PyTorch installed.
### Low VRAM Configuration ### Low VRAM Configuration
For GPUs with limited VRAM (4-8GB): For GPUs with limited VRAM (4-8GB):
...@@ -340,24 +431,59 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit ...@@ -340,24 +431,59 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
## Model Recommendations ## Model Recommendations
### Small Models (For Testing) ### NVIDIA Backend (HuggingFace Models)
#### Small Models (For Testing)
- `microsoft/DialoGPT-medium` (~345M parameters) - `microsoft/DialoGPT-medium` (~345M parameters)
- `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (~1.1B parameters) - `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (~1.1B parameters)
- `facebook/blenderbot-400M-distill` (~400M parameters) - `facebook/blenderbot-400M-distill` (~400M parameters)
### Medium Models (4-8GB VRAM with 4-bit) #### Medium Models (4-8GB VRAM with 4-bit)
- `meta-llama/Llama-2-7b-chat-hf` (~7B parameters) - `meta-llama/Llama-2-7b-chat-hf` (~7B parameters)
- `mistralai/Mistral-7B-Instruct-v0.2` (~7B parameters) - `mistralai/Mistral-7B-Instruct-v0.2` (~7B parameters)
- `HuggingFaceH4/zephyr-7b-beta` (~7B parameters) - `HuggingFaceH4/zephyr-7b-beta` (~7B parameters)
### Large Models (Multiple GPUs or High VRAM) #### Large Models (Multiple GPUs or High VRAM)
- `meta-llama/Llama-2-13b-chat-hf` (~13B parameters) - `meta-llama/Llama-2-13b-chat-hf` (~13B parameters)
- `meta-llama/Llama-2-70b-chat-hf` (~70B parameters) - requires multiple GPUs or disk offload - `meta-llama/Llama-2-70b-chat-hf` (~70B parameters) - requires multiple GPUs or disk offload
- `bigscience/bloom-7b1` (~7B parameters) - `bigscience/bloom-7b1` (~7B parameters)
### Vulkan Backend (GGUF Models)
#### Small Models (2-4GB VRAM)
- `TheBloke/phi-2-GGUF` - phi-2.Q4_K_M.gguf (~1.6B parameters, ~1GB file)
- `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` - tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
#### Medium Models (4-8GB VRAM)
- `TheBloke/Llama-2-7B-GGUF` - llama-2-7b.Q4_K_M.gguf (~4GB file)
- `TheBloke/Mistral-7B-Instruct-v0.2-GGUF` - mistral-7b-instruct-v0.2.Q4_K_M.gguf
- `microsoft/Phi-3-mini-4k-instruct-gguf` - Phi-3-mini-4k-instruct-q4.gguf
#### Large Models (8GB+ VRAM)
- `TheBloke/Llama-2-13B-GGUF` - llama-2-13b.Q4_K_M.gguf (~7.5GB file)
- `TheBloke/deepseek-coder-6.7B-base-GGUF` - deepseek-coder-6.7b-base.Q4_K_M.gguf
**GGUF Quantization Guide:**
- `Q4_K_M` - Best balance of speed/quality (recommended)
- `Q5_K_M` - Higher quality, slightly slower
- `Q6_K` - Near-unquantized quality
- `Q8_0` - Maximum quality, largest size
**Download Example:**
```bash
# Using huggingface-cli
huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf --local-dir ./models
# Or let coderai download automatically
python coderai --model TheBloke/Llama-2-7B-GGUF --backend vulkan
```
## Troubleshooting ## Troubleshooting
### Shell Redirection Error: "No such file or directory: '0.0'" ### Shell Redirection Error: "No such file or directory: '0.0'"
...@@ -473,6 +599,94 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit ...@@ -473,6 +599,94 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
2. Check Python version: `python --version` (should be 3.8+) 2. Check Python version: `python --version` (should be 3.8+)
3. Verify virtual environment is activated 3. Verify virtual environment is activated
### Vulkan-Specific Issues
**Problem**: "Vulkan backend not available" or llama-cpp fails to load
**Solutions**:
1. **Verify Vulkan drivers are installed:**
```bash
# Check Vulkan installation
vulkaninfo | grep "deviceName"
# Or install if missing
# Debian/Ubuntu:
sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
# Fedora:
sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers
```
2. **Reinstall llama-cpp-python with Vulkan:**
```bash
pip uninstall llama-cpp-python -y
CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
```
3. **Check GPU compatibility:**
- AMD RX 400 series and newer
- NVIDIA GTX 900 series and newer (but CUDA backend preferred for NVIDIA)
- Intel Arc GPUs (experimental)
**Problem**: GGUF model fails to load or produces garbled output
**Solutions**:
1. **Verify model format**: Must be GGUF format, not regular HuggingFace format
```bash
# Check file extension
ls -la model.gguf # Should end in .gguf
```
2. **Try different quantization**: Some GGUF files may be incompatible
- Q4_K_M is most compatible (recommended)
- Q5_K_M or Q6_K for higher quality
- Avoid IQ quants if having issues
3. **Check model architecture**: Some very new models may need updated llama-cpp
```bash
pip install --upgrade llama-cpp-python
```
**Problem**: Vulkan backend runs on CPU instead of GPU
**Solutions**:
1. **Check layer offloading**: Verify layers are being offloaded
```bash
# Check GPU layers parameter (default -1 = all layers)
python coderai --model model.gguf --backend vulkan --n-gpu-layers 35
```
2. **Check verbose output**: Look for Vulkan device initialization in logs
```bash
# Run with verbose logging
python coderai --model model.gguf --backend vulkan 2>&1 | grep -i vulkan
```
3. **Verify GPU visibility**: Check that Vulkan sees your GPU
```bash
vulkaninfo | grep -A 5 "GPU0\|GPU1"
```
### Backend Not Detected
**Problem**: "No suitable backend found" error
**Solutions**:
1. **Check which backends are available:**
```bash
python -c "import coderai; print(coderai.detect_available_backends())"
```
2. **For NVIDIA**: Ensure PyTorch with CUDA is installed
```bash
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
```
3. **For Vulkan**: Ensure llama-cpp-python is installed with Vulkan support
```bash
python -c "from llama_cpp import Llama; print('llama-cpp available')"
```
## License ## License
This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details. This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details.
...@@ -484,5 +698,10 @@ Contributions are welcome! Please feel free to submit a merge request. ...@@ -484,5 +698,10 @@ Contributions are welcome! Please feel free to submit a merge request.
## Acknowledgments ## Acknowledgments
- Built with [FastAPI](https://fastapi.tiangolo.com/) - Built with [FastAPI](https://fastapi.tiangolo.com/)
- Powered by [HuggingFace Transformers](https://huggingface.co/docs/transformers/) - Powered by [HuggingFace Transformers](https://huggingface.co/docs/transformers/) (NVIDIA backend)
- Powered by [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) with Vulkan support (AMD backend)
- Inspired by the OpenAI API specification - Inspired by the OpenAI API specification
---
**Note on AI.PROMPT**: This project was enhanced following instructions to add Vulkan support for AMD GPUs alongside the existing NVIDIA/CUDA support. The implementation uses llama-cpp-python for Vulkan/GGUF model support while maintaining full compatibility with the existing HuggingFace/Transformers backend for NVIDIA GPUs.
#!/bin/bash
# Build script for CoderAI - Supports NVIDIA (CUDA) and Vulkan (AMD GPUs) backends
# Usage: ./build.sh [nvidia|vulkan]
# Default: nvidia
set -e
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Determine backend
BACKEND="${1:-nvidia}"
BACKEND=$(echo "$BACKEND" | tr '[:upper:]' '[:lower:]')
if [[ "$BACKEND" != "nvidia" && "$BACKEND" != "vulkan" ]]; then
echo -e "${RED}Error: Invalid backend '$BACKEND'${NC}"
echo "Usage: ./build.sh [nvidia|vulkan]"
echo " nvidia - Use PyTorch with CUDA for NVIDIA GPUs"
echo " vulkan - Use llama-cpp-python with Vulkan for AMD GPUs"
exit 1
fi
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE} CoderAI Build Script${NC}"
echo -e "${BLUE} Backend: ${GREEN}$BACKEND${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
# Check Python version
PYTHON_VERSION=$(python3 --version 2>&1 | grep -oP '\d+\.\d+' | head -1)
REQUIRED_VERSION="3.8"
if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$PYTHON_VERSION" | sort -V | head -n1)" != "$REQUIRED_VERSION" ]; then
echo -e "${RED}Error: Python 3.8+ required, found $PYTHON_VERSION${NC}"
exit 1
fi
echo -e "${GREEN}✓ Python version: $PYTHON_VERSION${NC}"
# Create virtual environment if it doesn't exist
VENV_DIR="venv"
if [ ! -d "$VENV_DIR" ]; then
echo -e "${YELLOW}Creating virtual environment...${NC}"
python3 -m venv "$VENV_DIR"
fi
# Activate virtual environment
echo -e "${YELLOW}Activating virtual environment...${NC}"
source "$VENV_DIR/bin/activate"
# Upgrade pip
echo -e "${YELLOW}Upgrading pip...${NC}"
pip install --upgrade pip
echo ""
echo -e "${BLUE}Installing dependencies for $BACKEND backend...${NC}"
echo ""
if [ "$BACKEND" = "nvidia" ]; then
# NVIDIA/CUDA backend
echo -e "${YELLOW}Installing PyTorch with CUDA support...${NC}"
pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
echo -e "${YELLOW}Installing NVIDIA-specific requirements...${NC}"
pip install -r requirements-nvidia.txt
echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN} NVIDIA/CUDA build complete!${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo "Usage:"
echo " source venv/bin/activate"
echo " python coderai --model <huggingface-model-name>"
echo ""
echo "Example:"
echo " python coderai --model microsoft/DialoGPT-medium"
echo ""
elif [ "$BACKEND" = "vulkan" ]; then
# Vulkan backend
echo -e "${YELLOW}Installing llama-cpp-python with Vulkan support...${NC}"
# Check for required Vulkan development libraries
if ! pkg-config --exists vulkan 2>/dev/null; then
echo -e "${YELLOW}Warning: Vulkan development libraries not found via pkg-config${NC}"
echo -e "${YELLOW}You may need to install Vulkan drivers and SDK:${NC}"
echo " Debian/Ubuntu: sudo apt install libvulkan-dev vulkan-tools"
echo " Fedora: sudo dnf install vulkan-loader-devel vulkan-tools"
echo " Arch: sudo pacman -S vulkan-headers vulkan-icd-loader"
echo ""
echo -e "${YELLOW}Attempting installation anyway...${NC}"
fi
# Install llama-cpp-python with Vulkan support
# CMAKE_ARGS is used to enable Vulkan during compilation
CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
echo -e "${YELLOW}Installing Vulkan-specific requirements...${NC}"
pip install -r requirements-vulkan.txt
echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN} Vulkan build complete!${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo "Usage:"
echo " source venv/bin/activate"
echo " python coderai --model <path-to-gguf-model> --backend vulkan"
echo ""
echo "Example:"
echo " python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan"
echo ""
echo "Note: For Vulkan, you need to use GGUF format models."
echo " Download from: https://huggingface.co/models?search=gguf"
echo ""
fi
# Create .backend file to track which backend was used
echo "$BACKEND" > .backend
echo -e "${GREEN}Build completed successfully!${NC}"
echo ""
echo "To activate the environment in the future, run:"
echo " source venv/bin/activate"
#!/usr/bin/env python3 #!/usr/bin/env python3
""" """
OpenAI-compatible API server for HuggingFace models. OpenAI-compatible API server for HuggingFace models (NVIDIA) and GGUF models (Vulkan).
Supports CUDA, ROCm GPU auto-detection, memory-aware model loading, Supports CUDA (NVIDIA) and Vulkan (AMD) GPU backends, memory-aware model loading,
sequential offload (VRAM -> RAM -> Disk), streaming, and tool calling. streaming, and tool calling.
""" """
import argparse import argparse
...@@ -14,228 +14,54 @@ import sys ...@@ -14,228 +14,54 @@ import sys
import time import time
import uuid import uuid
import warnings import warnings
from abc import ABC, abstractmethod
from contextlib import asynccontextmanager from contextlib import asynccontextmanager
from typing import AsyncGenerator, Dict, List, Optional, Union from typing import AsyncGenerator, Dict, List, Optional, Union
import psutil import psutil
import torch
from fastapi import FastAPI, HTTPException, Request from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field from pydantic import BaseModel, Field
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
AutoConfig,
TextIteratorStreamer,
StoppingCriteria,
StoppingCriteriaList,
LogitsProcessor,
LogitsProcessorList,
)
from threading import Thread from threading import Thread
# ============================================================================= # =============================================================================
# Flash Attention Detection # Backend Detection and Imports
# ============================================================================= # =============================================================================
def check_flash_attn_availability() -> bool: def detect_available_backends():
"""Check if flash-attn is installed and available.""" """Detect which backends are available."""
backends = {'cpu': True}
# Check for PyTorch/CUDA
try: try:
import flash_attn import torch
return True if torch.cuda.is_available():
backends['nvidia'] = True
except ImportError: except ImportError:
return False pass
# =============================================================================
# Logits Processor for Numerical Stability
# =============================================================================
class InvalidLogitsProcessor(LogitsProcessor):
"""Replace NaN and Inf values in logits with finite values."""
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor: # Check for llama-cpp-python (Vulkan)
"""Replace invalid values in logits.""" try:
# Replace NaN with very negative number (near -inf but finite) import llama_cpp
scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores) backends['vulkan'] = True
# Replace Inf with large finite number except ImportError:
scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores) pass
# Replace -Inf with very negative finite number
scores = torch.where(scores < -1e9, torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores) return backends
return scores
# ============================================================================= # =============================================================================
# Memory Detection and Model Sizing # Flash Attention Detection (for NVIDIA backend)
# ============================================================================= # =============================================================================
def get_available_vram() -> int: def check_flash_attn_availability() -> bool:
"""Get available VRAM in bytes. Returns 0 if no GPU available.""" """Check if flash-attn is installed and available."""
if not torch.cuda.is_available():
return 0
try:
total_vram = 0
for i in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(i)
total_vram += props.total_memory
return total_vram
except Exception as e:
print(f"Warning: Could not detect VRAM: {e}")
return 0
def get_available_ram(manual_ram_gb: Optional[float] = None) -> int:
"""
Get available system RAM in bytes.
Args:
manual_ram_gb: If specified, use this value in GB instead of auto-detection
Returns:
Available RAM in bytes
"""
if manual_ram_gb is not None:
ram_bytes = int(manual_ram_gb * 1e9)
print(f"Using manually specified RAM: {manual_ram_gb} GB ({ram_bytes / 1e9:.2f} GB)")
return ram_bytes
try:
mem = psutil.virtual_memory()
print(f"Auto-detected RAM: {mem.available / 1e9:.2f} GB available")
return mem.available
except Exception as e:
print(f"Warning: Could not detect RAM: {e}")
return 0
def estimate_model_size_from_config(model_name: str) -> Optional[int]:
"""
Estimate model size in bytes from config.
Returns None if config cannot be loaded.
"""
try: try:
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True) import flash_attn
return True
# Get model parameters from config except ImportError:
if hasattr(config, 'num_parameters'): return False
num_params = config.num_parameters
elif hasattr(config, 'n_params'):
num_params = config.n_params
elif hasattr(config, 'num_hidden_layers') and hasattr(config, 'hidden_size'):
# Estimate based on transformer architecture
# Rough estimate: ~12 * num_layers * hidden_size^2 for standard transformers
layers = config.num_hidden_layers
hidden = config.hidden_size
vocab_size = getattr(config, 'vocab_size', 50000)
# Rough parameter count estimation
# Embedding: vocab_size * hidden_size
# Each layer: ~4 * hidden_size^2 (attn + FFN)
num_params = (vocab_size * hidden_size) + (layers * 4 * hidden_size * hidden_size)
else:
return None
# Assume float16 (2 bytes per parameter) for GPU loading
# This is the typical loading format
return num_params * 2
except Exception as e:
print(f"Warning: Could not estimate model size: {e}")
return None
def calculate_safety_margin(memory_bytes: int) -> int:
"""Apply safety margin to available memory (leave 10% headroom)."""
return int(memory_bytes * 0.9)
def determine_offload_strategy(
model_name: str,
available_vram: int,
available_ram: int,
quantization_bits: Optional[int] = None
) -> Dict[str, any]:
"""
Determine the best offload strategy based on available memory.
Returns a dict with:
- device_map: str or dict for model loading
- offload_folder: Optional[str] for disk offload
- load_in_8bit: bool
- load_in_4bit: bool
- max_memory: Optional[dict]
"""
# Estimate model size
estimated_size = estimate_model_size_from_config(model_name)
if estimated_size is None:
print("Could not estimate model size, using auto device_map")
return {
'device_map': 'auto',
'offload_folder': None,
'load_in_8bit': False,
'load_in_4bit': False,
'max_memory': None,
}
# Apply quantization factor if specified
if quantization_bits == 4:
estimated_size = estimated_size // 4 # 4-bit = 0.5 bytes per param
elif quantization_bits == 8:
estimated_size = estimated_size // 2 # 8-bit = 1 byte per param
# Add overhead for activations and gradients (roughly 20%)
required_memory = int(estimated_size * 1.2)
print(f"Estimated model size: {estimated_size / 1e9:.2f} GB")
print(f"Required memory (with overhead): {required_memory / 1e9:.2f} GB")
print(f"Available VRAM: {available_vram / 1e9:.2f} GB")
print(f"Available RAM: {available_ram / 1e9:.2f} GB")
safe_vram = calculate_safety_margin(available_vram)
safe_ram = calculate_safety_margin(available_ram)
strategy = {
'device_map': None,
'offload_folder': None,
'load_in_8bit': False,
'load_in_4bit': False,
'max_memory': None,
}
# Case 1: Model fits entirely in VRAM
if required_memory <= safe_vram:
print("Strategy: Loading fully to GPU")
strategy['device_map'] = 'cuda'
if torch.cuda.device_count() > 1:
strategy['device_map'] = 'auto'
# Case 2: Model fits in VRAM + RAM combined
elif required_memory <= (safe_vram + safe_ram):
print("Strategy: Using device_map='auto' for VRAM + RAM offload")
strategy['device_map'] = 'auto'
# Set max_memory to help accelerate distribute layers
if torch.cuda.is_available():
max_memory = {}
for i in range(torch.cuda.device_count()):
max_memory[i] = safe_vram // torch.cuda.device_count()
max_memory['cpu'] = safe_ram
strategy['max_memory'] = max_memory
# Case 3: Need disk offload
else:
print("Strategy: VRAM + RAM + Disk offload required")
strategy['device_map'] = 'auto'
if torch.cuda.is_available():
max_memory = {}
for i in range(torch.cuda.device_count()):
max_memory[i] = safe_vram // torch.cuda.device_count()
max_memory['cpu'] = safe_ram
strategy['max_memory'] = max_memory
# offload_folder will be set from command line argument
return strategy
# ============================================================================= # =============================================================================
...@@ -300,13 +126,13 @@ class ModelList(BaseModel): ...@@ -300,13 +126,13 @@ class ModelList(BaseModel):
# ============================================================================= # =============================================================================
# Tool Parsing and Function Calling # Tool Parsing
# ============================================================================= # =============================================================================
class ToolCallParser: class ToolCallParser:
"""Parse model outputs to extract tool calls.""" """Parse model outputs to extract tool calls."""
def __init__(self, tokenizer): def __init__(self, tokenizer=None):
self.tokenizer = tokenizer self.tokenizer = tokenizer
def extract_tool_calls(self, text: str, available_tools: List[Tool]) -> Optional[List[Dict]]: def extract_tool_calls(self, text: str, available_tools: List[Tool]) -> Optional[List[Dict]]:
...@@ -421,19 +247,59 @@ def format_tools_for_prompt(tools: List[Tool], messages: List[ChatMessage]) -> L ...@@ -421,19 +247,59 @@ def format_tools_for_prompt(tools: List[Tool], messages: List[ChatMessage]) -> L
# ============================================================================= # =============================================================================
# Model Management # Abstract Model Backend
# ============================================================================= # =============================================================================
class ModelManager: class ModelBackend(ABC):
"""Manages the loaded model and tokenizer.""" """Abstract base class for model backends."""
@abstractmethod
def load_model(self, model_name: str, **kwargs) -> None:
"""Load the model."""
pass
@abstractmethod
def generate(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> str:
"""Generate text non-streaming."""
pass
@abstractmethod
def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
"""Generate text in streaming fashion."""
pass
@abstractmethod
def format_messages(self, messages: List[ChatMessage]) -> str:
"""Format messages into a prompt string."""
pass
@abstractmethod
def get_model_name(self) -> str:
"""Return the loaded model name."""
pass
@abstractmethod
def cleanup(self) -> None:
"""Cleanup resources."""
pass
# =============================================================================
# NVIDIA/HuggingFace Backend
# =============================================================================
class NvidiaBackend(ModelBackend):
"""Backend for NVIDIA GPUs using HuggingFace Transformers."""
def __init__(self): def __init__(self):
self.model = None self.model = None
self.tokenizer = None self.tokenizer = None
self.model_name = None self.model_name = None
self.device = None self.device = None
self.tool_parser = None
self.offload_folder = None
self.use_flash_attn = False self.use_flash_attn = False
self.flash_attn_available = False self.flash_attn_available = False
...@@ -449,8 +315,9 @@ class ModelManager: ...@@ -449,8 +315,9 @@ class ModelManager:
print("Falling back to standard attention") print("Falling back to standard attention")
self.use_flash_attn = False self.use_flash_attn = False
def detect_device(self) -> str: def _detect_device(self) -> str:
"""Auto-detect available GPU or fall back to CPU.""" """Auto-detect available GPU or fall back to CPU."""
import torch
if torch.cuda.is_available(): if torch.cuda.is_available():
# Check for ROCm (HIP) # Check for ROCm (HIP)
if hasattr(torch.version, 'hip') and torch.version.hip is not None: if hasattr(torch.version, 'hip') and torch.version.hip is not None:
...@@ -463,71 +330,64 @@ class ModelManager: ...@@ -463,71 +330,64 @@ class ModelManager:
print("No GPU detected, using CPU") print("No GPU detected, using CPU")
return "cpu" return "cpu"
def load_model( def _get_available_vram(self) -> int:
self, """Get available VRAM in bytes. Returns 0 if no GPU available."""
model_name: str, import torch
offload_dir: Optional[str] = None, if not torch.cuda.is_available():
load_in_4bit: bool = False, return 0
load_in_8bit: bool = False,
manual_ram_gb: Optional[float] = None,
flash_attn: bool = False,
):
"""
Load the model and tokenizer from HuggingFace with memory-aware offload.
Args:
model_name: HuggingFace model name or path
offload_dir: Directory for disk offload when model doesn't fit in VRAM+RAM
load_in_4bit: Use 4-bit quantization (requires bitsandbytes)
load_in_8bit: Use 8-bit quantization (requires bitsandbytes)
manual_ram_gb: Manually specify available RAM in GB (bypasses auto-detection)
flash_attn: Use Flash Attention 2 if available (requires flash-attn package)
"""
print(f"Loading model: {model_name}")
self.use_flash_attn = flash_attn
self.check_flash_attn_support()
self.device = self.detect_device()
self.offload_folder = offload_dir
# Create offload directory if needed try:
if offload_dir: total_vram = 0
os.makedirs(offload_dir, exist_ok=True) for i in range(torch.cuda.device_count()):
print(f"Disk offload directory: {offload_dir}") props = torch.cuda.get_device_properties(i)
total_vram += props.total_memory
# Detect available memory return total_vram
available_vram = get_available_vram() except Exception as e:
available_ram = get_available_ram(manual_ram_gb) print(f"Warning: Could not detect VRAM: {e}")
return 0
def _estimate_model_size(self, model_name: str) -> Optional[int]:
"""Estimate model size in bytes from config."""
from transformers import AutoConfig
try:
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
# Get model parameters from config
if hasattr(config, 'num_parameters'):
num_params = config.num_parameters
elif hasattr(config, 'n_params'):
num_params = config.n_params
elif hasattr(config, 'num_hidden_layers') and hasattr(config, 'hidden_size'):
layers = config.num_hidden_layers
hidden = config.hidden_size
vocab_size = getattr(config, 'vocab_size', 50000)
num_params = (vocab_size * hidden_size) + (layers * 4 * hidden * hidden)
else:
return None
# Assume float16 (2 bytes per parameter)
return num_params * 2
except Exception as e:
print(f"Warning: Could not estimate model size: {e}")
return None
def load_model(self, model_name: str, **kwargs) -> None:
"""Load the model using HuggingFace Transformers."""
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
print(f"\nMemory Detection:") offload_dir = kwargs.get('offload_dir')
print(f" Available VRAM: {available_vram / 1e9:.2f} GB") load_in_4bit = kwargs.get('load_in_4bit', False)
print(f" Available RAM: {available_ram / 1e9:.2f} GB") load_in_8bit = kwargs.get('load_in_8bit', False)
manual_ram_gb = kwargs.get('manual_ram_gb')
flash_attn = kwargs.get('flash_attn', False)
# Determine quantization bits print(f"Loading HuggingFace model: {model_name}")
quantization_bits = None
if load_in_4bit:
quantization_bits = 4
elif load_in_8bit:
quantization_bits = 8
# Determine offload strategy self.use_flash_attn = flash_attn
strategy = determine_offload_strategy( self.check_flash_attn_support()
model_name,
available_vram,
available_ram,
quantization_bits
)
# Set offload folder if determined necessary self.device = self._detect_device()
if strategy.get('offload_folder') is None and offload_dir:
estimated_size = estimate_model_size_from_config(model_name)
safe_vram = calculate_safety_margin(available_vram)
safe_ram = calculate_safety_margin(available_ram)
if estimated_size and estimated_size > (safe_vram + safe_ram):
strategy['offload_folder'] = offload_dir
print(f"Model will use disk offload at: {offload_dir}")
# Load tokenizer # Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained( self.tokenizer = AutoTokenizer.from_pretrained(
...@@ -541,70 +401,48 @@ class ModelManager: ...@@ -541,70 +401,48 @@ class ModelManager:
self.tokenizer.pad_token = self.tokenizer.eos_token self.tokenizer.pad_token = self.tokenizer.eos_token
# Prepare model loading arguments # Prepare model loading arguments
load_kwargs = { load_kwargs = {'trust_remote_code': True}
'trust_remote_code': True,
}
# Set dtype based on device and quantization
if load_in_4bit or load_in_8bit: if load_in_4bit or load_in_8bit:
# Check if bitsandbytes is available
try: try:
import bitsandbytes as bnb import bitsandbytes as bnb
print(f"Using {4 if load_in_4bit else 8}-bit quantization") print(f"Using {4 if load_in_4bit else 8}-bit quantization")
load_kwargs['load_in_4bit'] = load_in_4bit load_kwargs['load_in_4bit'] = load_in_4bit
load_kwargs['load_in_8bit'] = load_in_8bit load_kwargs['load_in_8bit'] = load_in_8bit
load_kwargs['device_map'] = strategy['device_map'] or 'auto' load_kwargs['device_map'] = 'auto'
except ImportError: except ImportError:
print("Warning: bitsandbytes not installed. Quantization disabled.") print("Warning: bitsandbytes not installed. Quantization disabled.")
print("Install with: pip install bitsandbytes")
if self.device == "cuda": if self.device == "cuda":
load_kwargs['torch_dtype'] = torch.float16 load_kwargs['torch_dtype'] = torch.float16
else: else:
load_kwargs['torch_dtype'] = torch.float32 load_kwargs['torch_dtype'] = torch.float32
load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None) load_kwargs['device_map'] = 'auto' if self.device == 'cuda' else None
else: else:
if self.device == "cuda": if self.device == "cuda":
load_kwargs['torch_dtype'] = torch.float16 load_kwargs['torch_dtype'] = torch.float16
else: else:
load_kwargs['torch_dtype'] = torch.float32 load_kwargs['torch_dtype'] = torch.float32
load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None) load_kwargs['device_map'] = 'auto' if self.device == 'cuda' else None
# Add max_memory if specified
if strategy.get('max_memory'):
load_kwargs['max_memory'] = strategy['max_memory']
# Add offload_folder if specified # Add offload folder if specified
if strategy.get('offload_folder'): if offload_dir:
load_kwargs['offload_folder'] = strategy['offload_folder'] os.makedirs(offload_dir, exist_ok=True)
load_kwargs['offload_folder'] = offload_dir
print(f"Disk offload directory: {offload_dir}")
# Add Flash Attention 2 configuration if enabled and available # Add Flash Attention 2 if enabled
if self.use_flash_attn and self.flash_attn_available: if self.use_flash_attn and self.flash_attn_available:
load_kwargs['attn_implementation'] = "flash_attention_2" load_kwargs['attn_implementation'] = "flash_attention_2"
print("\nUsing Flash Attention 2 for attention implementation") print("Using Flash Attention 2")
print(f"\nModel loading arguments:")
for key, value in load_kwargs.items():
print(f" {key}: {value}")
# Load model # Load model
self.model = AutoModelForCausalLM.from_pretrained( self.model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
model_name,
**load_kwargs
)
# Handle CPU case where device_map is None
if self.device == "cpu" and load_kwargs.get('device_map') is None: if self.device == "cpu" and load_kwargs.get('device_map') is None:
self.model = self.model.to(self.device) self.model = self.model.to(self.device)
self.model.eval() self.model.eval()
self.model_name = model_name self.model_name = model_name
self.tool_parser = ToolCallParser(self.tokenizer)
# Print model device placement
if hasattr(self.model, 'hf_device_map'):
print(f"\nDevice map:")
for layer, device in self.model.hf_device_map.items():
print(f" {layer}: {device}")
print(f"\nModel loaded successfully") print(f"\nModel loaded successfully")
print(f"Model device: {next(self.model.parameters()).device}") print(f"Model device: {next(self.model.parameters()).device}")
...@@ -632,41 +470,74 @@ class ModelManager: ...@@ -632,41 +470,74 @@ class ModelManager:
formatted.append("Assistant:") formatted.append("Assistant:")
return "\n\n".join(formatted) return "\n\n".join(formatted)
def _validate_generation_params(self, temperature: float, top_p: float) -> tuple: def _validate_params(self, temperature: float, top_p: float) -> tuple:
"""Validate and clamp generation parameters for numerical stability.""" """Validate generation parameters."""
# Clamp temperature to avoid numerical issues
# Temperature must be > 0 for sampling, but very small values can cause issues
if temperature <= 0: if temperature <= 0:
temperature = 1.0 temperature = 1.0
do_sample = False do_sample = False
else: else:
temperature = max(0.01, min(temperature, 2.0)) temperature = max(0.01, min(temperature, 2.0))
do_sample = True do_sample = True
# Clamp top_p
top_p = max(0.0, min(top_p, 1.0)) top_p = max(0.0, min(top_p, 1.0))
return temperature, top_p, do_sample return temperature, top_p, do_sample
def generate_stream( def generate(self, prompt: str, max_tokens: Optional[int] = None,
self, temperature: float = 0.7, top_p: float = 1.0,
prompt: str, stop: Optional[List[str]] = None) -> str:
max_tokens: Optional[int] = None, """Generate text non-streaming."""
temperature: float = 0.7, import torch
top_p: float = 1.0, from transformers import LogitsProcessor, LogitsProcessorList
stop: Optional[List[str]] = None,
) -> AsyncGenerator[str, None]: class InvalidLogitsProcessor(LogitsProcessor):
"""Generate text in streaming fashion.""" def __call__(self, input_ids, scores):
scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores)
return scores
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True) inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()} inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
input_length = inputs["input_ids"].shape[1] if max_tokens is None:
max_tokens = 512
temperature, top_p, do_sample = self._validate_params(temperature, top_p)
with torch.no_grad():
outputs = self.model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_tokens,
temperature=temperature if do_sample else None,
top_p=top_p if do_sample else None,
do_sample=do_sample,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
"""Generate text in streaming fashion."""
import torch
from transformers import TextIteratorStreamer, LogitsProcessor, LogitsProcessorList, StoppingCriteria, StoppingCriteriaList
class InvalidLogitsProcessor(LogitsProcessor):
def __call__(self, input_ids, scores):
scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores)
return scores
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
if max_tokens is None: if max_tokens is None:
max_tokens = 512 max_tokens = 512
# Validate parameters temperature, top_p, do_sample = self._validate_params(temperature, top_p)
temperature, top_p, do_sample = self._validate_generation_params(temperature, top_p)
streamer = TextIteratorStreamer( streamer = TextIteratorStreamer(
self.tokenizer, self.tokenizer,
...@@ -684,13 +555,9 @@ class ModelManager: ...@@ -684,13 +555,9 @@ class ModelManager:
"streamer": streamer, "streamer": streamer,
"pad_token_id": self.tokenizer.pad_token_id, "pad_token_id": self.tokenizer.pad_token_id,
"eos_token_id": self.tokenizer.eos_token_id, "eos_token_id": self.tokenizer.eos_token_id,
"logits_processor": LogitsProcessorList([InvalidLogitsProcessor()]),
} }
# Add logits processor to handle NaN/Inf values
generation_kwargs["logits_processor"] = LogitsProcessorList([
InvalidLogitsProcessor()
])
# Handle stop sequences # Handle stop sequences
if stop: if stop:
class StopOnSequence(StoppingCriteria): class StopOnSequence(StoppingCriteria):
...@@ -706,106 +573,279 @@ class ModelManager: ...@@ -706,106 +573,279 @@ class ModelManager:
StopOnSequence(stop, self.tokenizer) StopOnSequence(stop, self.tokenizer)
]) ])
# Run generation in a separate thread with error handling # Run generation in a separate thread
generated_text = "" thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
try: thread.start()
thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
thread.start() for text in streamer:
yield text
for text in streamer:
generated_text += text thread.join()
yield text
def get_model_name(self) -> str:
thread.join() return self.model_name or "unknown"
except RuntimeError as e:
if "probability tensor contains" in str(e): def cleanup(self) -> None:
print(f"Warning: Numerical error during generation: {e}") import torch
print("This may be due to temperature=0 or numerical instability.") if self.model is not None:
print("Trying again with greedy decoding...") del self.model
# Fallback to greedy decoding del self.tokenizer
generation_kwargs["do_sample"] = False self.model = None
generation_kwargs["temperature"] = None self.tokenizer = None
generation_kwargs["top_p"] = None if torch.cuda.is_available():
thread = Thread(target=self.model.generate, kwargs=generation_kwargs) torch.cuda.empty_cache()
thread.start()
for text in streamer:
generated_text += text # =============================================================================
yield text # Vulkan Backend (llama-cpp-python)
thread.join() # =============================================================================
else:
class VulkanBackend(ModelBackend):
"""Backend for Vulkan (AMD GPUs) using llama-cpp-python with GGUF models."""
def __init__(self):
self.model = None
self.model_name = None
self.n_gpu_layers = -1 # Offload all layers to GPU by default
self.n_ctx = 2048
self.verbose = True
def load_model(self, model_name: str, **kwargs) -> None:
"""Load a GGUF model using llama-cpp-python."""
from llama_cpp import Llama
# model_name should be a path to a .gguf file or a HuggingFace model ID
# that will be resolved to a GGUF file
n_gpu_layers = kwargs.get('n_gpu_layers', -1)
n_ctx = kwargs.get('n_ctx', 2048)
verbose = kwargs.get('verbose', True)
# Check if model_name is a local file
if os.path.isfile(model_name):
model_path = model_name
print(f"Loading local GGUF model: {model_path}")
else:
# Try to download from HuggingFace Hub
print(f"Attempting to download GGUF model: {model_name}")
try:
from huggingface_hub import hf_hub_download, list_repo_files
# Parse model name (format: "org/model" or "org/model/filename.gguf")
parts = model_name.split('/')
if len(parts) >= 2:
repo_id = f"{parts[0]}/{parts[1]}"
# If specific file provided
if len(parts) >= 3 and parts[-1].endswith('.gguf'):
filename = '/'.join(parts[2:])
else:
# Find GGUF files in the repo
files = list_repo_files(repo_id)
gguf_files = [f for f in files if f.endswith('.gguf')]
if not gguf_files:
raise ValueError(f"No GGUF files found in {repo_id}")
# Prefer Q4_K_M quantized models for good balance
preferred = [f for f in gguf_files if 'Q4_K_M' in f or 'q4_k_m' in f.lower()]
if preferred:
filename = preferred[0]
else:
filename = gguf_files[0]
print(f"Selected GGUF file: {filename}")
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
print(f"Downloaded to: {model_path}")
else:
raise ValueError(f"Invalid model name format: {model_name}")
except Exception as e:
print(f"Error downloading model: {e}")
print("Please provide a local path to a .gguf file")
raise raise
print(f"Loading GGUF model with Vulkan support...")
print(f" Model path: {model_path}")
print(f" GPU layers: {n_gpu_layers} (-1 = all layers)")
print(f" Context size: {n_ctx}")
try:
self.model = Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers,
n_ctx=n_ctx,
verbose=verbose,
)
self.model_name = model_name
print("\nModel loaded successfully with Vulkan!")
except Exception as e:
print(f"Error loading model with Vulkan: {e}")
print("Make sure Vulkan drivers are installed:")
print(" Debian/Ubuntu: sudo apt install libvulkan-dev vulkan-tools")
print(" Fedora: sudo dnf install vulkan-loader-devel vulkan-tools")
raise
def generate( def format_messages(self, messages: List[ChatMessage]) -> str:
self, """Format messages into a prompt string suitable for chat models."""
prompt: str, formatted = []
max_tokens: Optional[int] = None,
temperature: float = 0.7, for msg in messages:
top_p: float = 1.0, if msg.role == "system":
stop: Optional[List[str]] = None, formatted.append(f"<|system|>\n{msg.content}")
) -> str: elif msg.role == "user":
"""Generate text non-streaming.""" formatted.append(f"<|user|>\n{msg.content}")
inputs = self.tokenizer(prompt, return_tensors="pt", padding=True) elif msg.role == "assistant":
inputs = {k: v.to(self.model.device) for k, v in inputs.items()} content = msg.content or ""
formatted.append(f"<|assistant|>\n{content}")
formatted.append("<|assistant|>\n")
return "\n".join(formatted)
def generate(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> str:
"""Generate text non-streaming using llama-cpp."""
if max_tokens is None: if max_tokens is None:
max_tokens = 512 max_tokens = 512
# Validate parameters output = self.model(
temperature, top_p, do_sample = self._validate_generation_params(temperature, top_p) prompt,
max_tokens=max_tokens,
temperature=temperature,
top_p=top_p,
stop=stop or [],
)
try: return output["choices"][0]["text"]
with torch.no_grad():
outputs = self.model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_tokens,
temperature=temperature if do_sample else None,
top_p=top_p if do_sample else None,
do_sample=do_sample,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
stopping_criteria=self._create_stopping_criteria(stop) if stop else None,
logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
except RuntimeError as e:
if "probability tensor contains" in str(e):
print(f"Warning: Numerical error during generation: {e}")
print("Retrying with greedy decoding...")
# Fallback to greedy decoding
with torch.no_grad():
outputs = self.model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=max_tokens,
do_sample=False,
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id,
stopping_criteria=self._create_stopping_criteria(stop) if stop else None,
logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
)
generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
else:
raise
def _create_stopping_criteria(self, stop_sequences): async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
"""Create stopping criteria for stop sequences.""" temperature: float = 0.7, top_p: float = 1.0,
if not stop_sequences: stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
return None """Generate text in streaming fashion using llama-cpp."""
if max_tokens is None:
max_tokens = 512
class StopOnSequence(StoppingCriteria): stream = self.model(
def __init__(self, stop_sequences, tokenizer): prompt,
self.stop_sequences = stop_sequences max_tokens=max_tokens,
self.tokenizer = tokenizer temperature=temperature,
top_p=top_p,
def __call__(self, input_ids, scores, **kwargs): stop=stop or [],
decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True) stream=True,
return any(seq in decoded for seq in self.stop_sequences) )
for chunk in stream:
text = chunk["choices"][0].get("text", "")
if text:
yield text
def get_model_name(self) -> str:
return self.model_name or "unknown"
def cleanup(self) -> None:
if self.model is not None:
del self.model
self.model = None
# =============================================================================
# Model Manager
# =============================================================================
class ModelManager:
"""Manages the loaded model and tokenizer."""
def __init__(self):
self.backend: Optional[ModelBackend] = None
self.backend_type: Optional[str] = None
self.tool_parser = ToolCallParser()
def load_model(self, model_name: str, backend_type: str = "auto", **kwargs):
"""
Load the model with the specified backend.
Args:
model_name: Model name or path
backend_type: 'nvidia', 'vulkan', or 'auto' to detect
**kwargs: Additional arguments for the specific backend
"""
available = detect_available_backends()
# Determine backend
if backend_type == "auto":
if available.get('nvidia'):
backend_type = "nvidia"
print("Auto-detected NVIDIA backend")
elif available.get('vulkan'):
backend_type = "vulkan"
print("Auto-detected Vulkan backend")
else:
print("No GPU backend detected. For NVIDIA, install PyTorch with CUDA.")
print("For Vulkan, install llama-cpp-python with Vulkan support.")
raise RuntimeError("No suitable backend found")
self.backend_type = backend_type
# Create appropriate backend
if backend_type == "nvidia":
if not available.get('nvidia'):
raise RuntimeError("NVIDIA backend requested but PyTorch/CUDA not available")
self.backend = NvidiaBackend()
elif backend_type == "vulkan":
if not available.get('vulkan'):
raise RuntimeError("Vulkan backend requested but llama-cpp-python not available")
self.backend = VulkanBackend()
else:
raise ValueError(f"Unknown backend: {backend_type}")
# Load the model
self.backend.load_model(model_name, **kwargs)
self.tool_parser = ToolCallParser()
return StoppingCriteriaList([StopOnSequence(stop_sequences, self.tokenizer)]) def format_messages(self, messages: List[ChatMessage]) -> str:
"""Format messages into a prompt string."""
if self.backend is None:
raise RuntimeError("No model loaded")
return self.backend.format_messages(messages)
def generate(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> str:
"""Generate text non-streaming."""
if self.backend is None:
raise RuntimeError("No model loaded")
return self.backend.generate(prompt, max_tokens, temperature, top_p, stop)
async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
temperature: float = 0.7, top_p: float = 1.0,
stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
"""Generate text in streaming fashion."""
if self.backend is None:
raise RuntimeError("No model loaded")
async for chunk in self.backend.generate_stream(prompt, max_tokens, temperature, top_p, stop):
yield chunk
@property
def model_name(self) -> str:
if self.backend is None:
return "unknown"
return self.backend.get_model_name()
@property
def model(self):
if self.backend is None:
return None
return self.backend
@property
def tokenizer(self):
# Only NVIDIA backend has a tokenizer
if isinstance(self.backend, NvidiaBackend):
return self.backend.tokenizer
return None
def cleanup(self):
if self.backend is not None:
self.backend.cleanup()
self.backend = None
# Global model manager # Global model manager
...@@ -822,16 +862,13 @@ async def lifespan(app: FastAPI): ...@@ -822,16 +862,13 @@ async def lifespan(app: FastAPI):
# Startup # Startup
yield yield
# Shutdown # Shutdown
if model_manager.model is not None: model_manager.cleanup()
del model_manager.model
del model_manager.tokenizer
torch.cuda.empty_cache() if torch.cuda.is_available() else None
app = FastAPI( app = FastAPI(
title="OpenAI-Compatible API", title="OpenAI-Compatible API",
description="OpenAI-compatible API for HuggingFace models with memory-aware loading", description="OpenAI-compatible API supporting NVIDIA (CUDA) and Vulkan backends",
version="1.0.0", version="2.0.0",
lifespan=lifespan, lifespan=lifespan,
) )
...@@ -850,7 +887,7 @@ async def list_models(): ...@@ -850,7 +887,7 @@ async def list_models():
@app.post("/v1/chat/completions") @app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest): async def chat_completions(request: ChatCompletionRequest):
"""Chat completions endpoint with streaming and tool support.""" """Chat completions endpoint with streaming and tool support."""
if model_manager.model is None: if model_manager.backend is None:
raise HTTPException(status_code=503, detail="Model not loaded") raise HTTPException(status_code=503, detail="Model not loaded")
# Format messages with tools if provided # Format messages with tools if provided
...@@ -910,7 +947,7 @@ async def stream_chat_response( ...@@ -910,7 +947,7 @@ async def stream_chat_response(
generated_text = "" generated_text = ""
try: try:
for chunk in model_manager.generate_stream( async for chunk in model_manager.generate_stream(
prompt=prompt, prompt=prompt,
max_tokens=max_tokens, max_tokens=max_tokens,
temperature=temperature, temperature=temperature,
...@@ -936,7 +973,6 @@ async def stream_chat_response( ...@@ -936,7 +973,6 @@ async def stream_chat_response(
if tools: if tools:
tool_calls = model_manager.tool_parser.extract_tool_calls(generated_text, tools) tool_calls = model_manager.tool_parser.extract_tool_calls(generated_text, tools)
if tool_calls: if tool_calls:
# Send tool calls as final delta
data = { data = {
"id": completion_id, "id": completion_id,
"object": "chat.completion.chunk", "object": "chat.completion.chunk",
...@@ -957,7 +993,6 @@ async def stream_chat_response( ...@@ -957,7 +993,6 @@ async def stream_chat_response(
yield "data: [DONE]\n\n" yield "data: [DONE]\n\n"
except Exception as e: except Exception as e:
print(f"Error during streaming generation: {e}") print(f"Error during streaming generation: {e}")
# Send error event
data = { data = {
"id": completion_id, "id": completion_id,
"object": "chat.completion.chunk", "object": "chat.completion.chunk",
...@@ -1010,6 +1045,15 @@ async def generate_chat_response( ...@@ -1010,6 +1045,15 @@ async def generate_chat_response(
response_message["content"] = None response_message["content"] = None
finish_reason = "tool_calls" finish_reason = "tool_calls"
# Calculate token counts if tokenizer available
if model_manager.tokenizer:
prompt_tokens = len(model_manager.tokenizer.encode(prompt))
completion_tokens = len(model_manager.tokenizer.encode(generated_text))
else:
# Rough estimate for Vulkan backend
prompt_tokens = len(prompt.split())
completion_tokens = len(generated_text.split())
return { return {
"id": completion_id, "id": completion_id,
"object": "chat.completion", "object": "chat.completion",
...@@ -1021,9 +1065,9 @@ async def generate_chat_response( ...@@ -1021,9 +1065,9 @@ async def generate_chat_response(
"finish_reason": finish_reason, "finish_reason": finish_reason,
}], }],
"usage": { "usage": {
"prompt_tokens": len(model_manager.tokenizer.encode(prompt)), "prompt_tokens": prompt_tokens,
"completion_tokens": len(model_manager.tokenizer.encode(generated_text)), "completion_tokens": completion_tokens,
"total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)), "total_tokens": prompt_tokens + completion_tokens,
}, },
} }
except Exception as e: except Exception as e:
...@@ -1034,7 +1078,7 @@ async def generate_chat_response( ...@@ -1034,7 +1078,7 @@ async def generate_chat_response(
@app.post("/v1/completions") @app.post("/v1/completions")
async def completions(request: CompletionRequest): async def completions(request: CompletionRequest):
"""Text completions endpoint.""" """Text completions endpoint."""
if model_manager.model is None: if model_manager.backend is None:
raise HTTPException(status_code=503, detail="Model not loaded") raise HTTPException(status_code=503, detail="Model not loaded")
prompts = request.prompt if isinstance(request.prompt, list) else [request.prompt] prompts = request.prompt if isinstance(request.prompt, list) else [request.prompt]
...@@ -1078,7 +1122,7 @@ async def stream_completion_response( ...@@ -1078,7 +1122,7 @@ async def stream_completion_response(
created = int(time.time()) created = int(time.time())
try: try:
for chunk in model_manager.generate_stream( async for chunk in model_manager.generate_stream(
prompt=prompt, prompt=prompt,
max_tokens=max_tokens, max_tokens=max_tokens,
temperature=temperature, temperature=temperature,
...@@ -1128,6 +1172,14 @@ async def generate_completion_response( ...@@ -1128,6 +1172,14 @@ async def generate_completion_response(
stop=stop, stop=stop,
) )
# Calculate token counts if tokenizer available
if model_manager.tokenizer:
prompt_tokens = len(model_manager.tokenizer.encode(prompt))
completion_tokens = len(model_manager.tokenizer.encode(generated_text))
else:
prompt_tokens = len(prompt.split())
completion_tokens = len(generated_text.split())
return { return {
"id": completion_id, "id": completion_id,
"object": "text_completion", "object": "text_completion",
...@@ -1140,9 +1192,9 @@ async def generate_completion_response( ...@@ -1140,9 +1192,9 @@ async def generate_completion_response(
"finish_reason": "stop", "finish_reason": "stop",
}], }],
"usage": { "usage": {
"prompt_tokens": len(model_manager.tokenizer.encode(prompt)), "prompt_tokens": prompt_tokens,
"completion_tokens": len(model_manager.tokenizer.encode(generated_text)), "completion_tokens": completion_tokens,
"total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)), "total_tokens": prompt_tokens + completion_tokens,
}, },
} }
except Exception as e: except Exception as e:
...@@ -1157,13 +1209,20 @@ async def generate_completion_response( ...@@ -1157,13 +1209,20 @@ async def generate_completion_response(
def parse_args(): def parse_args():
"""Parse command line arguments.""" """Parse command line arguments."""
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="OpenAI-compatible API server with memory-aware model loading" description="OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends"
) )
parser.add_argument( parser.add_argument(
"--model", "--model",
type=str, type=str,
default=None, default=None,
help="HuggingFace model name or path", help="Model name or path. For NVIDIA: HuggingFace model. For Vulkan: GGUF file path or HF repo",
)
parser.add_argument(
"--backend",
type=str,
choices=["auto", "nvidia", "vulkan"],
default="auto",
help="Backend to use: auto (detect), nvidia (CUDA), or vulkan (AMD GPUs)",
) )
parser.add_argument( parser.add_argument(
"--host", "--host",
...@@ -1181,68 +1240,116 @@ def parse_args(): ...@@ -1181,68 +1240,116 @@ def parse_args():
"--offload-dir", "--offload-dir",
type=str, type=str,
default="./offload", default="./offload",
help="Directory for disk offload when model doesn't fit in VRAM+RAM (default: ./offload)", help="Directory for disk offload (NVIDIA backend only, default: ./offload)",
) )
parser.add_argument( parser.add_argument(
"--load-in-4bit", "--load-in-4bit",
action="store_true", action="store_true",
help="Load model in 4-bit precision (requires bitsandbytes)", help="Load model in 4-bit precision (NVIDIA backend only, requires bitsandbytes)",
) )
parser.add_argument( parser.add_argument(
"--load-in-8bit", "--load-in-8bit",
action="store_true", action="store_true",
help="Load model in 8-bit precision (requires bitsandbytes)", help="Load model in 8-bit precision (NVIDIA backend only, requires bitsandbytes)",
) )
parser.add_argument( parser.add_argument(
"--ram", "--ram",
type=float, type=float,
default=None, default=None,
help="Manually specify available RAM in GB (bypasses auto-detection)", help="Manually specify available RAM in GB (NVIDIA backend only)",
) )
parser.add_argument( parser.add_argument(
"--flash-attn", "--flash-attn",
action="store_true", action="store_true",
help="Use Flash Attention 2 for faster inference (requires flash-attn package and compatible GPU)", help="Use Flash Attention 2 (NVIDIA backend only, requires flash-attn package)",
)
parser.add_argument(
"--n-gpu-layers",
type=int,
default=-1,
help="Number of layers to offload to GPU (Vulkan backend only, default: -1 = all layers)",
)
parser.add_argument(
"--n-ctx",
type=int,
default=2048,
help="Context window size (Vulkan backend only, default: 2048)",
) )
return parser.parse_args() return parser.parse_args()
def main(): def main():
"""Main entry point.""" """Main entry point."""
import procname # Optional: set process name if procname is available
procname.setprocname("coderai") try:
import procname
procname.setprocname("coderai")
except ImportError:
pass
args = parse_args() args = parse_args()
# Get model name from args or prompt interactively # Get model name from args or prompt interactively
model_name = args.model model_name = args.model
if model_name is None: if model_name is None:
print("No model specified. Please enter a HuggingFace model name.") print("No model specified. Please enter a model name.")
print("Examples:") print("")
print("For NVIDIA backend (HuggingFace models):")
print(" - microsoft/DialoGPT-medium") print(" - microsoft/DialoGPT-medium")
print(" - facebook/blenderbot-400M-distill")
print(" - meta-llama/Llama-2-7b-chat-hf (requires auth)") print(" - meta-llama/Llama-2-7b-chat-hf (requires auth)")
print(" - TinyLlama/TinyLlama-1.1B-Chat-v1.0") print(" - TinyLlama/TinyLlama-1.1B-Chat-v1.0")
print("") print("")
print("For Vulkan backend (GGUF models):")
print(" - Local path: ./phi-3-mini-4k-instruct-q4_k_m.gguf")
print(" - HuggingFace: microsoft/Phi-3-mini-4k-instruct-gguf")
print("")
model_name = input("Enter model name: ").strip() model_name = input("Enter model name: ").strip()
if not model_name: if not model_name:
print("Error: Model name is required") print("Error: Model name is required")
sys.exit(1) sys.exit(1)
# Load the model with memory-aware offload # Detect available backends
model_manager.load_model( available = detect_available_backends()
model_name=model_name, print("\nAvailable backends:")
offload_dir=args.offload_dir, for name, available_flag in available.items():
load_in_4bit=args.load_in_4bit, status = "✓" if available_flag else "✗"
load_in_8bit=args.load_in_8bit, print(f" [{status}] {name}")
manual_ram_gb=args.ram, print("")
flash_attn=getattr(args, 'flash_attn', False),
) # Load the model
load_kwargs = {
'offload_dir': args.offload_dir,
'load_in_4bit': args.load_in_4bit,
'load_in_8bit': args.load_in_8bit,
'manual_ram_gb': args.ram,
'flash_attn': args.flash_attn,
'n_gpu_layers': args.n_gpu_layers,
'n_ctx': args.n_ctx,
}
try:
model_manager.load_model(
model_name=model_name,
backend_type=args.backend,
**load_kwargs
)
except Exception as e:
print(f"\nError loading model: {e}")
print("\nTroubleshooting:")
if args.backend == "vulkan":
print(" - For Vulkan, ensure you have Vulkan drivers installed")
print(" - Make sure you're using a GGUF format model")
print(" - Run build.sh with 'vulkan' argument first")
else:
print(" - For NVIDIA, ensure PyTorch with CUDA is installed")
print(" - Run build.sh with 'nvidia' argument first")
sys.exit(1)
# Start the server # Start the server
import uvicorn import uvicorn
print(f"\nStarting server on http://{args.host}:{args.port}") print(f"\nStarting server on http://{args.host}:{args.port}")
print(f"API documentation available at http://{args.host}:{args.port}/docs") print(f"API documentation available at http://{args.host}:{args.port}/docs")
print(f"Using backend: {model_manager.backend_type}")
uvicorn.run(app, host=args.host, port=args.port) uvicorn.run(app, host=args.host, port=args.port)
......
# FastAPI and server dependencies
fastapi>=0.104.0
uvicorn[standard]>=0.24.0
pydantic>=2.5.0
# ML dependencies (transformers-based for NVIDIA/CUDA)
transformers>=4.35.0
accelerate>=0.24.0
# System resource detection
psutil>=5.9.0
procname>=0.3.0 # optional - for setting process name
# Optional: for better performance with NVIDIA GPUs
bitsandbytes>=0.41.0
sentencepiece>=0.1.99
protobuf>=3.20.0
# Optional: Flash Attention 2 for faster inference on supported NVIDIA GPUs
# Requires specific CUDA versions and may need manual installation
# Install with: pip install flash-attn --no-build-isolation
# flash-attn>=2.5.0
# FastAPI and server dependencies
fastapi>=0.104.0
uvicorn[standard]>=0.24.0
pydantic>=2.5.0
# llama-cpp-python is installed by build.sh with Vulkan support
# CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
# System resource detection
psutil>=5.9.0
procname>=0.3.0 # optional - for setting process name
# HuggingFace Hub for downloading GGUF models
huggingface-hub>=0.19.0
# No PyTorch needed for Vulkan backend - llama-cpp handles everything
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment