Add Vulkan support for AMD GPUs alongside NVIDIA/CUDA

- Add build.sh script with nvidia/vulkan arguments (default: nvidia) - Create backend abstraction: ModelBackend base class - Implement NvidiaBackend using HuggingFace Transformers - Implement VulkanBackend using llama-cpp-python with GGUF models - Add separate requirements files for nvidia and vulkan backends - Add --backend argument with auto/nvidia/vulkan options - Add Vulkan-specific options: --n-gpu-layers, --n-ctx - Make procname import optional - Update README with comprehensive Vulkan usage instructions - Add Vulkan troubleshooting section - Add GGUF model recommendations The application now supports: - NVIDIA GPUs via PyTorch/Transformers (HuggingFace models) - AMD GPUs via llama-cpp-python/Vulkan (GGUF models)

Add Vulkan support for AMD GPUs alongside NVIDIA/CUDA
- Add build.sh script with nvidia/vulkan arguments (default: nvidia) - Create backend abstraction: ModelBackend base class - Implement NvidiaBackend using HuggingFace Transformers - Implement VulkanBackend using llama-cpp-python with GGUF models - Add separate requirements files for nvidia and vulkan backends - Add --backend argument with auto/nvidia/vulkan options - Add Vulkan-specific options: --n-gpu-layers, --n-ctx - Make procname import optional - Update README with comprehensive Vulkan usage instructions - Add Vulkan troubleshooting section - Add GGUF model recommendations The application now supports: - NVIDIA GPUs via PyTorch/Transformers (HuggingFace models) - AMD GPUs via llama-cpp-python/Vulkan (GGUF models)
02fb99fa · Stefy Lanza (nextime / spora ) · ae1d0e38 · 02fb99fa · 02fb99fa · 02fb99fa
Commit 02fb99fa authored Feb 28, 2026 by Stefy Lanza (nextime / spora )
Showing with 1029 additions and 536 deletions

README.md README.md +297 -78

build.sh build.sh +129 -0

coderai coderai +565 -458

requirements-nvidia.txt requirements-nvidia.txt +22 -0

requirements-vulkan.txt requirements-vulkan.txt +16 -0

No files found.
--- a/README.md
+++ b/README.md
 # CoderAI
-An OpenAI-compatible API server for HuggingFace models with intelligent memory management, GPU auto-detection, and advanced features like tool calling and streaming.
+An OpenAI-compatible API server supporting both NVIDIA (CUDA) and AMD (Vulkan) GPUs. Uses HuggingFace Transformers for NVIDIA GPUs and llama-cpp-python with Vulkan for AMD GPUs.
 ## Features
+- **Dual Backend Support**: NVIDIA (CUDA) via PyTorch + Transformers, AMD (Vulkan) via llama-cpp-python
 - **OpenAI-Compatible API**: Drop-in replacement for OpenAI's API endpoints
- **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM
+- **Memory-Aware Model Loading**: Automatically determines optimal loading strategy based on available VRAM and RAM (NVIDIA)
- **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed
+- **Sequential Offloading**: Smart offload from VRAM → RAM → Disk when needed (NVIDIA)
- **Multi-GPU Support**: Automatic distribution across multiple CUDA/ROCm devices
+- **Multi-GPU Support**: Automatic distribution across multiple CUDA devices (NVIDIA)
- **GPU Auto-Detection**: Automatically detects CUDA (NVIDIA) or ROCm (AMD) GPUs
+- **GPU Auto-Detection**: Automatically detects available backends
- **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes for reduced memory usage
+- **Quantization Support**: 4-bit and 8-bit quantization via bitsandbytes (NVIDIA) or built-in GGUF quantization (Vulkan)
- **Flash Attention 2**: Optional faster attention implementation for supported GPUs
+- **Flash Attention 2**: Optional faster attention implementation for supported NVIDIA GPUs
 - **Streaming Responses**: Server-sent events for real-time token generation
 - **Tool Calling**: Support for function calling and tool use
 - **Multiple Endpoints**: `/v1/chat/completions`, `/v1/completions`, and `/v1/models`
@@ -21,68 +22,81 @@ An OpenAI-compatible API server for HuggingFace models with intelligent memory m
 - Python 3.8+
 - For NVIDIA GPUs: CUDA toolkit (11.8+ recommended)
- For AMD GPUs: ROCm (5.6+ recommended, 6.0+ preferred)
+- For AMD GPUs (Vulkan): Vulkan drivers and SDK
 - For CPU-only: No additional requirements
-### Basic Installation
+### Quick Install with Build Script
+The easiest way to install is using the provided build script:
 ```bash
 # Clone the repository
 git clone git@git.nexlab.net:nexlab/coderai.git
 cd coderai
-# Create virtual environment (recommended)
+# For NVIDIA GPUs (default)
-python -m venv venv
+./build.sh nvidia
-source venv/bin/activate  # On Windows: venv\Scripts\activate
-# Install base requirements
+# For AMD GPUs with Vulkan support
-pip install -r requirements.txt
+./build.sh vulkan
 ```
-### Platform-Specific PyTorch Installation
+The build script will:
+- Create a virtual environment
+- Install the appropriate dependencies for your GPU
+- Set up the correct backend
-PyTorch installation varies by platform. Uncomment the appropriate section in [`requirements.txt`](requirements.txt) or install manually:
+### Manual Installation
-> **⚠️ WARNING: Shell Redirection Issue**
+If you prefer manual installation:
-> When using `>=` in pip commands, always use **quotes** around the package specifier!
-> Without quotes, the shell interprets `>` as output redirection.
->
-> ❌ Wrong: `pip install torch>=2.0.0`  (creates file named "=2.0.0")
-> ✅ Correct: `pip install "torch>=2.0.0"` (with quotes)
-> ✅ Also correct: `pip install torch==2.0.0` (exact version, no >=)
-#### NVIDIA (CUDA)
 ```bash
-# For CUDA 11.8
+# Create virtual environment
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu118
+python -m venv venv
+source venv/bin/activate
-# For CUDA 12.1
+# For NVIDIA GPUs
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121
+pip install torch torchvision torchaudio
+pip install -r requirements-nvidia.txt
-# For CUDA 12.4 (latest)
+# For AMD GPUs with Vulkan
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
+CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
+pip install -r requirements-vulkan.txt
 ```
-#### AMD (ROCm)
+### Platform-Specific Requirements
-```bash
+#### NVIDIA (CUDA)
-# For ROCm 6.0 (recommended for newer AMD GPUs)
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0
-# For ROCm 5.6 (for older AMD GPUs)
+Requires:
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm5.6
+- NVIDIA GPU with CUDA support
-```
+- CUDA toolkit (11.8+ or 12.1+)
+- PyTorch with CUDA
+Models: HuggingFace format (safetensors/pytorch)
-> **Note**: ROCm 5.4.2 is deprecated. Use ROCm 5.6 or 6.0 for better compatibility.
+#### AMD (Vulkan)
-> Check available versions at: https://pytorch.org/get-started/locally/
-#### CPU Only
+Requires:
+- AMD GPU with Vulkan support (RX 400 series and newer)
+- Vulkan drivers and SDK
+**Install Vulkan drivers:**
 ```bash
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
+# Debian/Ubuntu
+sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
+# Fedora
+sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers
+# Arch Linux
+sudo pacman -S vulkan-headers vulkan-icd-loader vulkan-radeon
 ```
+Models: GGUF format (from HuggingFace or local files)
+**Note**: The Vulkan backend uses llama-cpp-python with GGUF models, which provides excellent performance on AMD GPUs without requiring ROCm.
 ### Optional Dependencies
 #### bitsandbytes (Quantization)
@@ -116,8 +130,14 @@ pip install flash-attn --no-build-isolation
 ### Basic Usage
 ```bash
-# Run with a specific model
+# Activate the virtual environment created by build.sh
-python coderai --model microsoft/DialoGPT-medium
+source venv/bin/activate
+# Run with NVIDIA backend (HuggingFace models)
+python coderai --model microsoft/DialoGPT-medium --backend nvidia
+# Run with Vulkan backend (GGUF models)
+python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
 # The server will start on http://0.0.0.0:8000 by default
 ```
@@ -125,28 +145,68 @@ python coderai --model microsoft/DialoGPT-medium
 ### Command-Line Options
 ```
-usage: coderai [-h] [--model MODEL] [--host HOST] [--port PORT]
+usage: coderai [-h] [--model MODEL] [--backend {auto,nvidia,vulkan}] [--host HOST]
-               [--offload-dir OFFLOAD_DIR] [--load-in-4bit] [--load-in-8bit]
+               [--port PORT] [--offload-dir OFFLOAD_DIR] [--load-in-4bit]
-               [--ram RAM] [--flash-attn]
+               [--load-in-8bit] [--ram RAM] [--flash-attn] [--n-gpu-layers N]
+               [--n-ctx N]
-OpenAI-compatible API server with memory-aware model loading
+OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends
 options:
  -h, --help            show this help message and exit
-  --model MODEL         HuggingFace model name or path
+  --model MODEL         Model name or path. For NVIDIA: HuggingFace model.
+                        For Vulkan: GGUF file path or HF repo
+  --backend {auto,nvidia,vulkan}
+                        Backend to use: auto (detect), nvidia (CUDA), or
+                        vulkan (AMD GPUs)
  --host HOST           Host to bind to (default: 0.0.0.0)
  --port PORT           Port to bind to (default: 8000)
  --offload-dir OFFLOAD_DIR
-                        Directory for disk offload when model doesn't fit in
+                        Directory for disk offload (NVIDIA only, default: ./offload)
-                        VRAM+RAM (default: ./offload)
+  --load-in-4bit        Load model in 4-bit precision (NVIDIA only, requires bitsandbytes)
-  --load-in-4bit        Load model in 4-bit precision (requires bitsandbytes)
+  --load-in-8bit        Load model in 8-bit precision (NVIDIA only, requires bitsandbytes)
-  --load-in-8bit        Load model in 8-bit precision (requires bitsandbytes)
+  --ram RAM             Manually specify available RAM in GB (NVIDIA only)
-  --ram RAM             Manually specify available RAM in GB (bypasses auto-
+  --flash-attn          Use Flash Attention 2 (NVIDIA only, requires flash-attn)
-                        detection)
+  --n-gpu-layers N      Number of layers to offload to GPU (Vulkan only,
-  --flash-attn          Use Flash Attention 2 for faster inference (requires
+                        default: -1 = all layers)
-                        flash-attn package and compatible GPU)
+  --n-ctx N             Context window size (Vulkan only, default: 2048)
+```
+### Backend Selection
+The `--backend` option controls which backend to use:
+- **`auto`** (default): Automatically detects available backends, preferring NVIDIA if available
+- **`nvidia`**: Use PyTorch + Transformers with CUDA (for NVIDIA GPUs)
+- **`vulkan`**: Use llama-cpp-python with Vulkan (for AMD GPUs)
+### Model Formats by Backend
+#### NVIDIA Backend
+Uses HuggingFace Transformers format:
+```bash
+python coderai --model microsoft/DialoGPT-medium --backend nvidia
+python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia
 ```
+#### Vulkan Backend
+Uses GGUF format (can be local files or downloaded from HuggingFace):
+```bash
+# Local GGUF file
+python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
+# Download from HuggingFace (auto-selects GGUF file)
+python coderai --model microsoft/Phi-3-mini-4k-instruct-gguf --backend vulkan
+# Specific GGUF file from repo
+python coderai --model TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_K_M.gguf --backend vulkan
+```
+**Finding GGUF models:**
+- Search on HuggingFace: https://huggingface.co/models?search=gguf
+- Popular collections: TheBloke, unsloth, bartowski
+- Recommended quantization: Q4_K_M for best speed/quality balance
 ### Examples
 #### Run with 4-bit Quantization (Low VRAM)
@@ -276,41 +336,72 @@ curl -X POST http://localhost:8000/v1/chat/completions \
 ## Configuration for Different Setups
-### CUDA (NVIDIA GPU)
+### NVIDIA (CUDA)
 ```bash
-# Install CUDA-enabled PyTorch
+# Using build script
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121
+./build.sh nvidia
+# Or manually install CUDA-enabled PyTorch
+pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
+pip install -r requirements-nvidia.txt
-# Run with GPU acceleration (automatic)
+# Run with GPU acceleration
-python coderai --model meta-llama/Llama-2-7b-chat-hf
+python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia
 # Optional: Enable Flash Attention 2 for faster inference
-python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn
+python coderai --model meta-llama/Llama-2-7b-chat-hf --backend nvidia --flash-attn
 ```
-### ROCm (AMD GPU)
+### AMD (Vulkan)
 ```bash
-# Install ROCm-enabled PyTorch (use 6.0 for newer GPUs, 5.6 for older)
+# Install Vulkan drivers first
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0
+# Debian/Ubuntu:
+sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
+# Using build script
+./build.sh vulkan
+# Run with GGUF model
+python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan
-# Run with GPU acceleration (automatic)
+# Or download automatically from HuggingFace
-python coderai --model meta-llama/Llama-2-7b-chat-hf
+python coderai --model TheBloke/Llama-2-7B-GGUF --backend vulkan
-# Check ROCm detection in output
+# Control GPU layer offloading (default: -1 = all layers)
+python coderai --model model.gguf --backend vulkan --n-gpu-layers 35
+# Adjust context window (default: 2048)
+python coderai --model model.gguf --backend vulkan --n-ctx 4096
 ```
+**Vulkan Backend Notes:**
+- Uses GGUF format models (much smaller than full HuggingFace models)
+- Q4_K_M quantization recommended for 4GB+ VRAM GPUs
+- Q5_K_M or Q6_K for higher quality
+- Works on AMD RX 400 series and newer
+- Also works on NVIDIA GPUs but CUDA backend is preferred for NVIDIA
 ### CPU-Only
-```bash
+While not recommended for performance, you can run on CPU:
-# Install CPU-only PyTorch
-pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
-# Run on CPU (automatic fallback)
+```bash
-python coderai --model microsoft/DialoGPT-medium
+# NVIDIA backend on CPU
+pip install "torch>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
+pip install -r requirements-nvidia.txt
+python coderai --model microsoft/DialoGPT-medium --backend nvidia
+# Or Vulkan backend on CPU (llama-cpp supports CPU fallback)
+CMAKE_ARGS="-DGGML_VULKAN=OFF" pip install llama-cpp-python
+python coderai --model model.gguf --backend vulkan
 ```
+### ROCm Alternative (deprecated)
+While the Vulkan backend is now recommended for AMD GPUs, ROCm support is still available through the NVIDIA backend if you have ROCm-enabled PyTorch installed.
 ### Low VRAM Configuration
 For GPUs with limited VRAM (4-8GB):
@@ -340,24 +431,59 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
 ## Model Recommendations
-### Small Models (For Testing)
+### NVIDIA Backend (HuggingFace Models)
+#### Small Models (For Testing)
 - `microsoft/DialoGPT-medium` (~345M parameters)
 - `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (~1.1B parameters)
 - `facebook/blenderbot-400M-distill` (~400M parameters)
-### Medium Models (4-8GB VRAM with 4-bit)
+#### Medium Models (4-8GB VRAM with 4-bit)
 - `meta-llama/Llama-2-7b-chat-hf` (~7B parameters)
 - `mistralai/Mistral-7B-Instruct-v0.2` (~7B parameters)
 - `HuggingFaceH4/zephyr-7b-beta` (~7B parameters)
-### Large Models (Multiple GPUs or High VRAM)
+#### Large Models (Multiple GPUs or High VRAM)
 - `meta-llama/Llama-2-13b-chat-hf` (~13B parameters)
 - `meta-llama/Llama-2-70b-chat-hf` (~70B parameters) - requires multiple GPUs or disk offload
 - `bigscience/bloom-7b1` (~7B parameters)
+### Vulkan Backend (GGUF Models)
+#### Small Models (2-4GB VRAM)
+- `TheBloke/phi-2-GGUF` - phi-2.Q4_K_M.gguf (~1.6B parameters, ~1GB file)
+- `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` - tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
+#### Medium Models (4-8GB VRAM)
+- `TheBloke/Llama-2-7B-GGUF` - llama-2-7b.Q4_K_M.gguf (~4GB file)
+- `TheBloke/Mistral-7B-Instruct-v0.2-GGUF` - mistral-7b-instruct-v0.2.Q4_K_M.gguf
+- `microsoft/Phi-3-mini-4k-instruct-gguf` - Phi-3-mini-4k-instruct-q4.gguf
+#### Large Models (8GB+ VRAM)
+- `TheBloke/Llama-2-13B-GGUF` - llama-2-13b.Q4_K_M.gguf (~7.5GB file)
+- `TheBloke/deepseek-coder-6.7B-base-GGUF` - deepseek-coder-6.7b-base.Q4_K_M.gguf
+**GGUF Quantization Guide:**
+- `Q4_K_M` - Best balance of speed/quality (recommended)
+- `Q5_K_M` - Higher quality, slightly slower
+- `Q6_K` - Near-unquantized quality
+- `Q8_0` - Maximum quality, largest size
+**Download Example:**
+```bash
+# Using huggingface-cli
+huggingface-cli download TheBloke/Llama-2-7B-GGUF llama-2-7b.Q4_K_M.gguf --local-dir ./models
+# Or let coderai download automatically
+python coderai --model TheBloke/Llama-2-7B-GGUF --backend vulkan
+```
 ## Troubleshooting
 ### Shell Redirection Error: "No such file or directory: '0.0'"
@@ -473,6 +599,94 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
 2. Check Python version: `python --version` (should be 3.8+)
 3. Verify virtual environment is activated
+### Vulkan-Specific Issues
+**Problem**: "Vulkan backend not available" or llama-cpp fails to load
+**Solutions**:
+1. **Verify Vulkan drivers are installed:**
+   ```bash
+   # Check Vulkan installation
+   vulkaninfo | grep "deviceName"
+   # Or install if missing
+   # Debian/Ubuntu:
+   sudo apt install libvulkan-dev vulkan-tools mesa-vulkan-drivers
+   # Fedora:
+   sudo dnf install vulkan-loader-devel vulkan-tools mesa-vulkan-drivers
+   ```
+2. **Reinstall llama-cpp-python with Vulkan:**
+   ```bash
+   pip uninstall llama-cpp-python -y
+   CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
+   ```
+3. **Check GPU compatibility:**
+   - AMD RX 400 series and newer
+   - NVIDIA GTX 900 series and newer (but CUDA backend preferred for NVIDIA)
+   - Intel Arc GPUs (experimental)
+**Problem**: GGUF model fails to load or produces garbled output
+**Solutions**:
+1. **Verify model format**: Must be GGUF format, not regular HuggingFace format
+   ```bash
+   # Check file extension
+   ls -la model.gguf  # Should end in .gguf
+   ```
+2. **Try different quantization**: Some GGUF files may be incompatible
+   - Q4_K_M is most compatible (recommended)
+   - Q5_K_M or Q6_K for higher quality
+   - Avoid IQ quants if having issues
+3. **Check model architecture**: Some very new models may need updated llama-cpp
+   ```bash
+   pip install --upgrade llama-cpp-python
+   ```
+**Problem**: Vulkan backend runs on CPU instead of GPU
+**Solutions**:
+1. **Check layer offloading**: Verify layers are being offloaded
+   ```bash
+   # Check GPU layers parameter (default -1 = all layers)
+   python coderai --model model.gguf --backend vulkan --n-gpu-layers 35
+   ```
+2. **Check verbose output**: Look for Vulkan device initialization in logs
+   ```bash
+   # Run with verbose logging
+   python coderai --model model.gguf --backend vulkan 2>&1 | grep -i vulkan
+   ```
+3. **Verify GPU visibility**: Check that Vulkan sees your GPU
+   ```bash
+   vulkaninfo | grep -A 5 "GPU0\|GPU1"
+   ```
+### Backend Not Detected
+**Problem**: "No suitable backend found" error
+**Solutions**:
+1. **Check which backends are available:**
+   ```bash
+   python -c "import coderai; print(coderai.detect_available_backends())"
+   ```
+2. **For NVIDIA**: Ensure PyTorch with CUDA is installed
+   ```bash
+   python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')"
+   ```
+3. **For Vulkan**: Ensure llama-cpp-python is installed with Vulkan support
+   ```bash
+   python -c "from llama_cpp import Llama; print('llama-cpp available')"
+   ```
 ## License
 This project is licensed under the GNU General Public License v3.0 - see the [LICENSE.md](LICENSE.md) file for details.
@@ -484,5 +698,10 @@ Contributions are welcome! Please feel free to submit a merge request.
 ## Acknowledgments
 - Built with [FastAPI](https://fastapi.tiangolo.com/)
- Powered by [HuggingFace Transformers](https://huggingface.co/docs/transformers/)
+- Powered by [HuggingFace Transformers](https://huggingface.co/docs/transformers/) (NVIDIA backend)
+- Powered by [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) with Vulkan support (AMD backend)
 - Inspired by the OpenAI API specification
+---
+**Note on AI.PROMPT**: This project was enhanced following instructions to add Vulkan support for AMD GPUs alongside the existing NVIDIA/CUDA support. The implementation uses llama-cpp-python for Vulkan/GGUF model support while maintaining full compatibility with the existing HuggingFace/Transformers backend for NVIDIA GPUs.
--- a/build.sh
+++ b/build.sh
+#!/bin/bash
+# Build script for CoderAI - Supports NVIDIA (CUDA) and Vulkan (AMD GPUs) backends
+# Usage: ./build.sh [nvidia|vulkan]
+# Default: nvidia
+set -e
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+# Determine backend
+BACKEND="${1:-nvidia}"
+BACKEND=$(echo "$BACKEND" | tr '[:upper:]' '[:lower:]')
+if [[ "$BACKEND" != "nvidia" && "$BACKEND" != "vulkan" ]]; then
+    echo -e "${RED}Error: Invalid backend '$BACKEND'${NC}"
+    echo "Usage: ./build.sh [nvidia|vulkan]"
+    echo "  nvidia  - Use PyTorch with CUDA for NVIDIA GPUs"
+    echo "  vulkan  - Use llama-cpp-python with Vulkan for AMD GPUs"
+    exit 1
+fi
+echo -e "${BLUE}========================================${NC}"
+echo -e "${BLUE}  CoderAI Build Script${NC}"
+echo -e "${BLUE}  Backend: ${GREEN}$BACKEND${NC}"
+echo -e "${BLUE}========================================${NC}"
+echo ""
+# Check Python version
+PYTHON_VERSION=$(python3 --version 2>&1 | grep -oP '\d+\.\d+' | head -1)
+REQUIRED_VERSION="3.8"
+if [ "$(printf '%s\n' "$REQUIRED_VERSION" "$PYTHON_VERSION" | sort -V | head -n1)" != "$REQUIRED_VERSION" ]; then
+    echo -e "${RED}Error: Python 3.8+ required, found $PYTHON_VERSION${NC}"
+    exit 1
+fi
+echo -e "${GREEN}✓ Python version: $PYTHON_VERSION${NC}"
+# Create virtual environment if it doesn't exist
+VENV_DIR="venv"
+if [ ! -d "$VENV_DIR" ]; then
+    echo -e "${YELLOW}Creating virtual environment...${NC}"
+    python3 -m venv "$VENV_DIR"
+fi
+# Activate virtual environment
+echo -e "${YELLOW}Activating virtual environment...${NC}"
+source "$VENV_DIR/bin/activate"
+# Upgrade pip
+echo -e "${YELLOW}Upgrading pip...${NC}"
+pip install --upgrade pip
+echo ""
+echo -e "${BLUE}Installing dependencies for $BACKEND backend...${NC}"
+echo ""
+if [ "$BACKEND" = "nvidia" ]; then
+    # NVIDIA/CUDA backend
+    echo -e "${YELLOW}Installing PyTorch with CUDA support...${NC}"
+    pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
+    echo -e "${YELLOW}Installing NVIDIA-specific requirements...${NC}"
+    pip install -r requirements-nvidia.txt
+    echo ""
+    echo -e "${GREEN}========================================${NC}"
+    echo -e "${GREEN}  NVIDIA/CUDA build complete!${NC}"
+    echo -e "${GREEN}========================================${NC}"
+    echo ""
+    echo "Usage:"
+    echo "  source venv/bin/activate"
+    echo "  python coderai --model <huggingface-model-name>"
+    echo ""
+    echo "Example:"
+    echo "  python coderai --model microsoft/DialoGPT-medium"
+    echo ""
+elif [ "$BACKEND" = "vulkan" ]; then
+    # Vulkan backend
+    echo -e "${YELLOW}Installing llama-cpp-python with Vulkan support...${NC}"
+    # Check for required Vulkan development libraries
+    if ! pkg-config --exists vulkan 2>/dev/null; then
+        echo -e "${YELLOW}Warning: Vulkan development libraries not found via pkg-config${NC}"
+        echo -e "${YELLOW}You may need to install Vulkan drivers and SDK:${NC}"
+        echo "  Debian/Ubuntu: sudo apt install libvulkan-dev vulkan-tools"
+        echo "  Fedora: sudo dnf install vulkan-loader-devel vulkan-tools"
+        echo "  Arch: sudo pacman -S vulkan-headers vulkan-icd-loader"
+        echo ""
+        echo -e "${YELLOW}Attempting installation anyway...${NC}"
+    fi
+    # Install llama-cpp-python with Vulkan support
+    # CMAKE_ARGS is used to enable Vulkan during compilation
+    CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
+    echo -e "${YELLOW}Installing Vulkan-specific requirements...${NC}"
+    pip install -r requirements-vulkan.txt
+    echo ""
+    echo -e "${GREEN}========================================${NC}"
+    echo -e "${GREEN}  Vulkan build complete!${NC}"
+    echo -e "${GREEN}========================================${NC}"
+    echo ""
+    echo "Usage:"
+    echo "  source venv/bin/activate"
+    echo "  python coderai --model <path-to-gguf-model> --backend vulkan"
+    echo ""
+    echo "Example:"
+    echo "  python coderai --model ./phi-3-mini-4k-instruct-q4_k_m.gguf --backend vulkan"
+    echo ""
+    echo "Note: For Vulkan, you need to use GGUF format models."
+    echo "      Download from: https://huggingface.co/models?search=gguf"
+    echo ""
+fi
+# Create .backend file to track which backend was used
+echo "$BACKEND" > .backend
+echo -e "${GREEN}Build completed successfully!${NC}"
+echo ""
+echo "To activate the environment in the future, run:"
+echo "  source venv/bin/activate"
--- a/coderai
+++ b/coderai
 #!/usr/bin/env python3
 """
-OpenAI-compatible API server for HuggingFace models.
+OpenAI-compatible API server for HuggingFace models (NVIDIA) and GGUF models (Vulkan).
-Supports CUDA, ROCm GPU auto-detection, memory-aware model loading,
+Supports CUDA (NVIDIA) and Vulkan (AMD) GPU backends, memory-aware model loading,
-sequential offload (VRAM -> RAM -> Disk), streaming, and tool calling.
+streaming, and tool calling.
 """
 import argparse
@@ -14,228 +14,54 @@ import sys
 import time
 import uuid
 import warnings
+from abc import ABC, abstractmethod
 from contextlib import asynccontextmanager
 from typing import AsyncGenerator, Dict, List, Optional, Union
 import psutil
-import torch
 from fastapi import FastAPI, HTTPException, Request
 from fastapi.responses import StreamingResponse
 from pydantic import BaseModel, Field
-from transformers import (
-    AutoModelForCausalLM,
-    AutoTokenizer,
-    AutoConfig,
-    TextIteratorStreamer,
-    StoppingCriteria,
-    StoppingCriteriaList,
-    LogitsProcessor,
-    LogitsProcessorList,
-)
 from threading import Thread
 # =============================================================================
-# Flash Attention Detection
+# Backend Detection and Imports
 # =============================================================================
-def check_flash_attn_availability() -> bool:
+def detect_available_backends():
-    """Check if flash-attn is installed and available."""
+    """Detect which backends are available."""
+    backends = {'cpu': True}
+    # Check for PyTorch/CUDA
    try:
-        import flash_attn
+        import torch
-        return True
+        if torch.cuda.is_available():
+            backends['nvidia'] = True
    except ImportError:
-        return False
+        pass
-# =============================================================================
-# Logits Processor for Numerical Stability
-# =============================================================================
-class InvalidLogitsProcessor(LogitsProcessor):
-    """Replace NaN and Inf values in logits with finite values."""
-    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor) -> torch.FloatTensor:
+    # Check for llama-cpp-python (Vulkan)
-        """Replace invalid values in logits."""
+    try:
-        # Replace NaN with very negative number (near -inf but finite)
+        import llama_cpp
-        scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
+        backends['vulkan'] = True
-        # Replace Inf with large finite number
+    except ImportError:
-        scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores)
+        pass
-        # Replace -Inf with very negative finite number
-        scores = torch.where(scores < -1e9, torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
+    return backends
-        return scores
 # =============================================================================
-# Memory Detection and Model Sizing
+# Flash Attention Detection (for NVIDIA backend)
 # =============================================================================
-def get_available_vram() -> int:
+def check_flash_attn_availability() -> bool:
-    """Get available VRAM in bytes. Returns 0 if no GPU available."""
+    """Check if flash-attn is installed and available."""
-    if not torch.cuda.is_available():
-        return 0
-    try:
-        total_vram = 0
-        for i in range(torch.cuda.device_count()):
-            props = torch.cuda.get_device_properties(i)
-            total_vram += props.total_memory
-        return total_vram
-    except Exception as e:
-        print(f"Warning: Could not detect VRAM: {e}")
-        return 0
-def get_available_ram(manual_ram_gb: Optional[float] = None) -> int:
-    """
-    Get available system RAM in bytes.
-    Args:
-        manual_ram_gb: If specified, use this value in GB instead of auto-detection
-    Returns:
-        Available RAM in bytes
-    """
-    if manual_ram_gb is not None:
-        ram_bytes = int(manual_ram_gb * 1e9)
-        print(f"Using manually specified RAM: {manual_ram_gb} GB ({ram_bytes / 1e9:.2f} GB)")
-        return ram_bytes
-    try:
-        mem = psutil.virtual_memory()
-        print(f"Auto-detected RAM: {mem.available / 1e9:.2f} GB available")
-        return mem.available
-    except Exception as e:
-        print(f"Warning: Could not detect RAM: {e}")
-        return 0
-def estimate_model_size_from_config(model_name: str) -> Optional[int]:
-    """
-    Estimate model size in bytes from config.
-    Returns None if config cannot be loaded.
-    """
    try:
-        config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+        import flash_attn
+        return True
-        # Get model parameters from config
+    except ImportError:
-        if hasattr(config, 'num_parameters'):
+        return False
-            num_params = config.num_parameters
-        elif hasattr(config, 'n_params'):
-            num_params = config.n_params
-        elif hasattr(config, 'num_hidden_layers') and hasattr(config, 'hidden_size'):
-            # Estimate based on transformer architecture
-            # Rough estimate: ~12 * num_layers * hidden_size^2 for standard transformers
-            layers = config.num_hidden_layers
-            hidden = config.hidden_size
-            vocab_size = getattr(config, 'vocab_size', 50000)
-            # Rough parameter count estimation
-            # Embedding: vocab_size * hidden_size
-            # Each layer: ~4 * hidden_size^2 (attn + FFN)
-            num_params = (vocab_size * hidden_size) + (layers * 4 * hidden_size * hidden_size)
-        else:
-            return None
-        # Assume float16 (2 bytes per parameter) for GPU loading
-        # This is the typical loading format
-        return num_params * 2
-    except Exception as e:
-        print(f"Warning: Could not estimate model size: {e}")
-        return None
-def calculate_safety_margin(memory_bytes: int) -> int:
-    """Apply safety margin to available memory (leave 10% headroom)."""
-    return int(memory_bytes * 0.9)
-def determine_offload_strategy(
-    model_name: str,
-    available_vram: int,
-    available_ram: int,
-    quantization_bits: Optional[int] = None
-) -> Dict[str, any]:
-    """
-    Determine the best offload strategy based on available memory.
-    Returns a dict with:
-    - device_map: str or dict for model loading
-    - offload_folder: Optional[str] for disk offload
-    - load_in_8bit: bool
-    - load_in_4bit: bool
-    - max_memory: Optional[dict]
-    """
-    # Estimate model size
-    estimated_size = estimate_model_size_from_config(model_name)
-    if estimated_size is None:
-        print("Could not estimate model size, using auto device_map")
-        return {
-            'device_map': 'auto',
-            'offload_folder': None,
-            'load_in_8bit': False,
-            'load_in_4bit': False,
-            'max_memory': None,
-        }
-    # Apply quantization factor if specified
-    if quantization_bits == 4:
-        estimated_size = estimated_size // 4  # 4-bit = 0.5 bytes per param
-    elif quantization_bits == 8:
-        estimated_size = estimated_size // 2  # 8-bit = 1 byte per param
-    # Add overhead for activations and gradients (roughly 20%)
-    required_memory = int(estimated_size * 1.2)
-    print(f"Estimated model size: {estimated_size / 1e9:.2f} GB")
-    print(f"Required memory (with overhead): {required_memory / 1e9:.2f} GB")
-    print(f"Available VRAM: {available_vram / 1e9:.2f} GB")
-    print(f"Available RAM: {available_ram / 1e9:.2f} GB")
-    safe_vram = calculate_safety_margin(available_vram)
-    safe_ram = calculate_safety_margin(available_ram)
-    strategy = {
-        'device_map': None,
-        'offload_folder': None,
-        'load_in_8bit': False,
-        'load_in_4bit': False,
-        'max_memory': None,
-    }
-    # Case 1: Model fits entirely in VRAM
-    if required_memory <= safe_vram:
-        print("Strategy: Loading fully to GPU")
-        strategy['device_map'] = 'cuda'
-        if torch.cuda.device_count() > 1:
-            strategy['device_map'] = 'auto'
-    # Case 2: Model fits in VRAM + RAM combined
-    elif required_memory <= (safe_vram + safe_ram):
-        print("Strategy: Using device_map='auto' for VRAM + RAM offload")
-        strategy['device_map'] = 'auto'
-        # Set max_memory to help accelerate distribute layers
-        if torch.cuda.is_available():
-            max_memory = {}
-            for i in range(torch.cuda.device_count()):
-                max_memory[i] = safe_vram // torch.cuda.device_count()
-            max_memory['cpu'] = safe_ram
-            strategy['max_memory'] = max_memory
-    # Case 3: Need disk offload
-    else:
-        print("Strategy: VRAM + RAM + Disk offload required")
-        strategy['device_map'] = 'auto'
-        if torch.cuda.is_available():
-            max_memory = {}
-            for i in range(torch.cuda.device_count()):
-                max_memory[i] = safe_vram // torch.cuda.device_count()
-            max_memory['cpu'] = safe_ram
-            strategy['max_memory'] = max_memory
-        # offload_folder will be set from command line argument
-    return strategy
 # =============================================================================
@@ -300,13 +126,13 @@ class ModelList(BaseModel):
 # =============================================================================
-# Tool Parsing and Function Calling
+# Tool Parsing
 # =============================================================================
 class ToolCallParser:
    """Parse model outputs to extract tool calls."""
-    def __init__(self, tokenizer):
+    def __init__(self, tokenizer=None):
        self.tokenizer = tokenizer
    def extract_tool_calls(self, text: str, available_tools: List[Tool]) -> Optional[List[Dict]]:
@@ -421,19 +247,59 @@ def format_tools_for_prompt(tools: List[Tool], messages: List[ChatMessage]) -> L
 # =============================================================================
-# Model Management
+# Abstract Model Backend
 # =============================================================================
-class ModelManager:
+class ModelBackend(ABC):
-    """Manages the loaded model and tokenizer."""
+    """Abstract base class for model backends."""
+    @abstractmethod
+    def load_model(self, model_name: str, **kwargs) -> None:
+        """Load the model."""
+        pass
+    @abstractmethod
+    def generate(self, prompt: str, max_tokens: Optional[int] = None, 
+                 temperature: float = 0.7, top_p: float = 1.0,
+                 stop: Optional[List[str]] = None) -> str:
+        """Generate text non-streaming."""
+        pass
+    @abstractmethod
+    def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
+                        temperature: float = 0.7, top_p: float = 1.0,
+                        stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
+        """Generate text in streaming fashion."""
+        pass
+    @abstractmethod
+    def format_messages(self, messages: List[ChatMessage]) -> str:
+        """Format messages into a prompt string."""
+        pass
+    @abstractmethod
+    def get_model_name(self) -> str:
+        """Return the loaded model name."""
+        pass
+    @abstractmethod
+    def cleanup(self) -> None:
+        """Cleanup resources."""
+        pass
+# =============================================================================
+# NVIDIA/HuggingFace Backend
+# =============================================================================
+class NvidiaBackend(ModelBackend):
+    """Backend for NVIDIA GPUs using HuggingFace Transformers."""
    def __init__(self):
        self.model = None
        self.tokenizer = None
        self.model_name = None
        self.device = None
-        self.tool_parser = None
-        self.offload_folder = None
        self.use_flash_attn = False
        self.flash_attn_available = False
@@ -449,8 +315,9 @@ class ModelManager:
                print("Falling back to standard attention")
                self.use_flash_attn = False
-    def detect_device(self) -> str:
+    def _detect_device(self) -> str:
        """Auto-detect available GPU or fall back to CPU."""
+        import torch
        if torch.cuda.is_available():
            # Check for ROCm (HIP)
            if hasattr(torch.version, 'hip') and torch.version.hip is not None:
@@ -463,71 +330,64 @@ class ModelManager:
            print("No GPU detected, using CPU")
            return "cpu"
-    def load_model(
+    def _get_available_vram(self) -> int:
-        self,
+        """Get available VRAM in bytes. Returns 0 if no GPU available."""
-        model_name: str,
+        import torch
-        offload_dir: Optional[str] = None,
+        if not torch.cuda.is_available():
-        load_in_4bit: bool = False,
+            return 0
-        load_in_8bit: bool = False,
-        manual_ram_gb: Optional[float] = None,
-        flash_attn: bool = False,
-    ):
-        """
-        Load the model and tokenizer from HuggingFace with memory-aware offload.
-        Args:
-            model_name: HuggingFace model name or path
-            offload_dir: Directory for disk offload when model doesn't fit in VRAM+RAM
-            load_in_4bit: Use 4-bit quantization (requires bitsandbytes)
-            load_in_8bit: Use 8-bit quantization (requires bitsandbytes)
-            manual_ram_gb: Manually specify available RAM in GB (bypasses auto-detection)
-            flash_attn: Use Flash Attention 2 if available (requires flash-attn package)
-        """
-        print(f"Loading model: {model_name}")
-        self.use_flash_attn = flash_attn
-        self.check_flash_attn_support()
-        self.device = self.detect_device()
-        self.offload_folder = offload_dir
-        # Create offload directory if needed
+        try:
-        if offload_dir:
+            total_vram = 0
-            os.makedirs(offload_dir, exist_ok=True)
+            for i in range(torch.cuda.device_count()):
-            print(f"Disk offload directory: {offload_dir}")
+                props = torch.cuda.get_device_properties(i)
+                total_vram += props.total_memory
-        # Detect available memory
+            return total_vram
-        available_vram = get_available_vram()
+        except Exception as e:
-        available_ram = get_available_ram(manual_ram_gb)
+            print(f"Warning: Could not detect VRAM: {e}")
+            return 0
+    def _estimate_model_size(self, model_name: str) -> Optional[int]:
+        """Estimate model size in bytes from config."""
+        from transformers import AutoConfig
+        try:
+            config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+            # Get model parameters from config
+            if hasattr(config, 'num_parameters'):
+                num_params = config.num_parameters
+            elif hasattr(config, 'n_params'):
+                num_params = config.n_params
+            elif hasattr(config, 'num_hidden_layers') and hasattr(config, 'hidden_size'):
+                layers = config.num_hidden_layers
+                hidden = config.hidden_size
+                vocab_size = getattr(config, 'vocab_size', 50000)
+                num_params = (vocab_size * hidden_size) + (layers * 4 * hidden * hidden)
+            else:
+                return None
+            # Assume float16 (2 bytes per parameter)
+            return num_params * 2
+        except Exception as e:
+            print(f"Warning: Could not estimate model size: {e}")
+            return None
+    def load_model(self, model_name: str, **kwargs) -> None:
+        """Load the model using HuggingFace Transformers."""
+        import torch
+        from transformers import AutoModelForCausalLM, AutoTokenizer
-        print(f"\nMemory Detection:")
+        offload_dir = kwargs.get('offload_dir')
-        print(f"  Available VRAM: {available_vram / 1e9:.2f} GB")
+        load_in_4bit = kwargs.get('load_in_4bit', False)
-        print(f"  Available RAM: {available_ram / 1e9:.2f} GB")
+        load_in_8bit = kwargs.get('load_in_8bit', False)
+        manual_ram_gb = kwargs.get('manual_ram_gb')
+        flash_attn = kwargs.get('flash_attn', False)
-        # Determine quantization bits
+        print(f"Loading HuggingFace model: {model_name}")
-        quantization_bits = None
-        if load_in_4bit:
-            quantization_bits = 4
-        elif load_in_8bit:
-            quantization_bits = 8
-        # Determine offload strategy
+        self.use_flash_attn = flash_attn
-        strategy = determine_offload_strategy(
+        self.check_flash_attn_support()
-            model_name,
-            available_vram,
-            available_ram,
-            quantization_bits
-        )
-        # Set offload folder if determined necessary
+        self.device = self._detect_device()
-        if strategy.get('offload_folder') is None and offload_dir:
-            estimated_size = estimate_model_size_from_config(model_name)
-            safe_vram = calculate_safety_margin(available_vram)
-            safe_ram = calculate_safety_margin(available_ram)
-            if estimated_size and estimated_size > (safe_vram + safe_ram):
-                strategy['offload_folder'] = offload_dir
-                print(f"Model will use disk offload at: {offload_dir}")
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
@@ -541,70 +401,48 @@ class ModelManager:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        # Prepare model loading arguments
-        load_kwargs = {
+        load_kwargs = {'trust_remote_code': True}
-            'trust_remote_code': True,
-        }
-        # Set dtype based on device and quantization
        if load_in_4bit or load_in_8bit:
-            # Check if bitsandbytes is available
            try:
                import bitsandbytes as bnb
                print(f"Using {4 if load_in_4bit else 8}-bit quantization")
                load_kwargs['load_in_4bit'] = load_in_4bit
                load_kwargs['load_in_8bit'] = load_in_8bit
-                load_kwargs['device_map'] = strategy['device_map'] or 'auto'
+                load_kwargs['device_map'] = 'auto'
            except ImportError:
                print("Warning: bitsandbytes not installed. Quantization disabled.")
-                print("Install with: pip install bitsandbytes")
                if self.device == "cuda":
                    load_kwargs['torch_dtype'] = torch.float16
                else:
                    load_kwargs['torch_dtype'] = torch.float32
-                load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None)
+                load_kwargs['device_map'] = 'auto' if self.device == 'cuda' else None
        else:
            if self.device == "cuda":
                load_kwargs['torch_dtype'] = torch.float16
            else:
                load_kwargs['torch_dtype'] = torch.float32
-            load_kwargs['device_map'] = strategy['device_map'] or ('auto' if self.device == 'cuda' else None)
+            load_kwargs['device_map'] = 'auto' if self.device == 'cuda' else None
-        # Add max_memory if specified
-        if strategy.get('max_memory'):
-            load_kwargs['max_memory'] = strategy['max_memory']
-        # Add offload_folder if specified
+        # Add offload folder if specified
-        if strategy.get('offload_folder'):
+        if offload_dir:
-            load_kwargs['offload_folder'] = strategy['offload_folder']
+            os.makedirs(offload_dir, exist_ok=True)
+            load_kwargs['offload_folder'] = offload_dir
+            print(f"Disk offload directory: {offload_dir}")
-        # Add Flash Attention 2 configuration if enabled and available
+        # Add Flash Attention 2 if enabled
        if self.use_flash_attn and self.flash_attn_available:
            load_kwargs['attn_implementation'] = "flash_attention_2"
-            print("\nUsing Flash Attention 2 for attention implementation")
+            print("Using Flash Attention 2")
-        print(f"\nModel loading arguments:")
-        for key, value in load_kwargs.items():
-            print(f"  {key}: {value}")
        # Load model
-        self.model = AutoModelForCausalLM.from_pretrained(
+        self.model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)
-            model_name,
-            **load_kwargs
-        )
-        # Handle CPU case where device_map is None
        if self.device == "cpu" and load_kwargs.get('device_map') is None:
            self.model = self.model.to(self.device)
        self.model.eval()
        self.model_name = model_name
-        self.tool_parser = ToolCallParser(self.tokenizer)
-        # Print model device placement
-        if hasattr(self.model, 'hf_device_map'):
-            print(f"\nDevice map:")
-            for layer, device in self.model.hf_device_map.items():
-                print(f"  {layer}: {device}")
        print(f"\nModel loaded successfully")
        print(f"Model device: {next(self.model.parameters()).device}")
@@ -632,41 +470,74 @@ class ModelManager:
        formatted.append("Assistant:")
        return "\n\n".join(formatted)
-    def _validate_generation_params(self, temperature: float, top_p: float) -> tuple:
+    def _validate_params(self, temperature: float, top_p: float) -> tuple:
-        """Validate and clamp generation parameters for numerical stability."""
+        """Validate generation parameters."""
-        # Clamp temperature to avoid numerical issues
-        # Temperature must be > 0 for sampling, but very small values can cause issues
        if temperature <= 0:
            temperature = 1.0
            do_sample = False
        else:
            temperature = max(0.01, min(temperature, 2.0))
            do_sample = True
-        # Clamp top_p
        top_p = max(0.0, min(top_p, 1.0))
        return temperature, top_p, do_sample
-    def generate_stream(
+    def generate(self, prompt: str, max_tokens: Optional[int] = None,
-        self,
+                 temperature: float = 0.7, top_p: float = 1.0,
-        prompt: str,
+                 stop: Optional[List[str]] = None) -> str:
-        max_tokens: Optional[int] = None,
+        """Generate text non-streaming."""
-        temperature: float = 0.7,
+        import torch
-        top_p: float = 1.0,
+        from transformers import LogitsProcessor, LogitsProcessorList
-        stop: Optional[List[str]] = None,
-    ) -> AsyncGenerator[str, None]:
+        class InvalidLogitsProcessor(LogitsProcessor):
-        """Generate text in streaming fashion."""
+            def __call__(self, input_ids, scores):
+                scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
+                scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores)
+                return scores
        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
-        input_length = inputs["input_ids"].shape[1]
+        if max_tokens is None:
+            max_tokens = 512
+        temperature, top_p, do_sample = self._validate_params(temperature, top_p)
+        with torch.no_grad():
+            outputs = self.model.generate(
+                input_ids=inputs["input_ids"],
+                attention_mask=inputs["attention_mask"],
+                max_new_tokens=max_tokens,
+                temperature=temperature if do_sample else None,
+                top_p=top_p if do_sample else None,
+                do_sample=do_sample,
+                pad_token_id=self.tokenizer.pad_token_id,
+                eos_token_id=self.tokenizer.eos_token_id,
+                logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
+            )
+        generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
+        return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
+    async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
+                              temperature: float = 0.7, top_p: float = 1.0,
+                              stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
+        """Generate text in streaming fashion."""
+        import torch
+        from transformers import TextIteratorStreamer, LogitsProcessor, LogitsProcessorList, StoppingCriteria, StoppingCriteriaList
+        class InvalidLogitsProcessor(LogitsProcessor):
+            def __call__(self, input_ids, scores):
+                scores = torch.where(torch.isnan(scores), torch.tensor(-1e9, dtype=scores.dtype, device=scores.device), scores)
+                scores = torch.where(torch.isinf(scores), torch.tensor(1e9, dtype=scores.dtype, device=scores.device), scores)
+                return scores
+        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
+        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
        if max_tokens is None:
            max_tokens = 512
-        # Validate parameters
+        temperature, top_p, do_sample = self._validate_params(temperature, top_p)
-        temperature, top_p, do_sample = self._validate_generation_params(temperature, top_p)
        streamer = TextIteratorStreamer(
            self.tokenizer,
@@ -684,13 +555,9 @@ class ModelManager:
            "streamer": streamer,
            "pad_token_id": self.tokenizer.pad_token_id,
            "eos_token_id": self.tokenizer.eos_token_id,
+            "logits_processor": LogitsProcessorList([InvalidLogitsProcessor()]),
        }
-        # Add logits processor to handle NaN/Inf values
-        generation_kwargs["logits_processor"] = LogitsProcessorList([
-            InvalidLogitsProcessor()
-        ])
        # Handle stop sequences
        if stop:
            class StopOnSequence(StoppingCriteria):
@@ -706,106 +573,279 @@ class ModelManager:
                StopOnSequence(stop, self.tokenizer)
            ])
-        # Run generation in a separate thread with error handling
+        # Run generation in a separate thread
-        generated_text = ""
+        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
-        try:
+        thread.start()
-            thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
-            thread.start()
+        for text in streamer:
+            yield text
-            for text in streamer:
-                generated_text += text
+        thread.join()
-                yield text
+    def get_model_name(self) -> str:
-            thread.join()
+        return self.model_name or "unknown"
-        except RuntimeError as e:
-            if "probability tensor contains" in str(e):
+    def cleanup(self) -> None:
-                print(f"Warning: Numerical error during generation: {e}")
+        import torch
-                print("This may be due to temperature=0 or numerical instability.")
+        if self.model is not None:
-                print("Trying again with greedy decoding...")
+            del self.model
-                # Fallback to greedy decoding
+            del self.tokenizer
-                generation_kwargs["do_sample"] = False
+            self.model = None
-                generation_kwargs["temperature"] = None
+            self.tokenizer = None
-                generation_kwargs["top_p"] = None
+            if torch.cuda.is_available():
-                thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
+                torch.cuda.empty_cache()
-                thread.start()
-                for text in streamer:
-                    generated_text += text
+# =============================================================================
-                    yield text
+# Vulkan Backend (llama-cpp-python)
-                thread.join()
+# =============================================================================
-            else:
+class VulkanBackend(ModelBackend):
+    """Backend for Vulkan (AMD GPUs) using llama-cpp-python with GGUF models."""
+    def __init__(self):
+        self.model = None
+        self.model_name = None
+        self.n_gpu_layers = -1  # Offload all layers to GPU by default
+        self.n_ctx = 2048
+        self.verbose = True
+    def load_model(self, model_name: str, **kwargs) -> None:
+        """Load a GGUF model using llama-cpp-python."""
+        from llama_cpp import Llama
+        # model_name should be a path to a .gguf file or a HuggingFace model ID
+        # that will be resolved to a GGUF file
+        n_gpu_layers = kwargs.get('n_gpu_layers', -1)
+        n_ctx = kwargs.get('n_ctx', 2048)
+        verbose = kwargs.get('verbose', True)
+        # Check if model_name is a local file
+        if os.path.isfile(model_name):
+            model_path = model_name
+            print(f"Loading local GGUF model: {model_path}")
+        else:
+            # Try to download from HuggingFace Hub
+            print(f"Attempting to download GGUF model: {model_name}")
+            try:
+                from huggingface_hub import hf_hub_download, list_repo_files
+                # Parse model name (format: "org/model" or "org/model/filename.gguf")
+                parts = model_name.split('/')
+                if len(parts) >= 2:
+                    repo_id = f"{parts[0]}/{parts[1]}"
+                    # If specific file provided
+                    if len(parts) >= 3 and parts[-1].endswith('.gguf'):
+                        filename = '/'.join(parts[2:])
+                    else:
+                        # Find GGUF files in the repo
+                        files = list_repo_files(repo_id)
+                        gguf_files = [f for f in files if f.endswith('.gguf')]
+                        if not gguf_files:
+                            raise ValueError(f"No GGUF files found in {repo_id}")
+                        # Prefer Q4_K_M quantized models for good balance
+                        preferred = [f for f in gguf_files if 'Q4_K_M' in f or 'q4_k_m' in f.lower()]
+                        if preferred:
+                            filename = preferred[0]
+                        else:
+                            filename = gguf_files[0]
+                        print(f"Selected GGUF file: {filename}")
+                    model_path = hf_hub_download(repo_id=repo_id, filename=filename)
+                    print(f"Downloaded to: {model_path}")
+                else:
+                    raise ValueError(f"Invalid model name format: {model_name}")
+            except Exception as e:
+                print(f"Error downloading model: {e}")
+                print("Please provide a local path to a .gguf file")
                raise
+        print(f"Loading GGUF model with Vulkan support...")
+        print(f"  Model path: {model_path}")
+        print(f"  GPU layers: {n_gpu_layers} (-1 = all layers)")
+        print(f"  Context size: {n_ctx}")
+        try:
+            self.model = Llama(
+                model_path=model_path,
+                n_gpu_layers=n_gpu_layers,
+                n_ctx=n_ctx,
+                verbose=verbose,
+            )
+            self.model_name = model_name
+            print("\nModel loaded successfully with Vulkan!")
+        except Exception as e:
+            print(f"Error loading model with Vulkan: {e}")
+            print("Make sure Vulkan drivers are installed:")
+            print("  Debian/Ubuntu: sudo apt install libvulkan-dev vulkan-tools")
+            print("  Fedora: sudo dnf install vulkan-loader-devel vulkan-tools")
+            raise
-    def generate(
+    def format_messages(self, messages: List[ChatMessage]) -> str:
-        self,
+        """Format messages into a prompt string suitable for chat models."""
-        prompt: str,
+        formatted = []
-        max_tokens: Optional[int] = None,
-        temperature: float = 0.7,
+        for msg in messages:
-        top_p: float = 1.0,
+            if msg.role == "system":
-        stop: Optional[List[str]] = None,
+                formatted.append(f"<|system|>\n{msg.content}")
-    ) -> str:
+            elif msg.role == "user":
-        """Generate text non-streaming."""
+                formatted.append(f"<|user|>\n{msg.content}")
-        inputs = self.tokenizer(prompt, return_tensors="pt", padding=True)
+            elif msg.role == "assistant":
-        inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
+                content = msg.content or ""
+                formatted.append(f"<|assistant|>\n{content}")
+        formatted.append("<|assistant|>\n")
+        return "\n".join(formatted)
+    def generate(self, prompt: str, max_tokens: Optional[int] = None,
+                 temperature: float = 0.7, top_p: float = 1.0,
+                 stop: Optional[List[str]] = None) -> str:
+        """Generate text non-streaming using llama-cpp."""
        if max_tokens is None:
            max_tokens = 512
-        # Validate parameters
+        output = self.model(
-        temperature, top_p, do_sample = self._validate_generation_params(temperature, top_p)
+            prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            top_p=top_p,
+            stop=stop or [],
+        )
-        try:
+        return output["choices"][0]["text"]
-            with torch.no_grad():
-                outputs = self.model.generate(
-                    input_ids=inputs["input_ids"],
-                    attention_mask=inputs["attention_mask"],
-                    max_new_tokens=max_tokens,
-                    temperature=temperature if do_sample else None,
-                    top_p=top_p if do_sample else None,
-                    do_sample=do_sample,
-                    pad_token_id=self.tokenizer.pad_token_id,
-                    eos_token_id=self.tokenizer.eos_token_id,
-                    stopping_criteria=self._create_stopping_criteria(stop) if stop else None,
-                    logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
-                )
-            generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
-            return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
-        except RuntimeError as e:
-            if "probability tensor contains" in str(e):
-                print(f"Warning: Numerical error during generation: {e}")
-                print("Retrying with greedy decoding...")
-                # Fallback to greedy decoding
-                with torch.no_grad():
-                    outputs = self.model.generate(
-                        input_ids=inputs["input_ids"],
-                        attention_mask=inputs["attention_mask"],
-                        max_new_tokens=max_tokens,
-                        do_sample=False,
-                        pad_token_id=self.tokenizer.pad_token_id,
-                        eos_token_id=self.tokenizer.eos_token_id,
-                        stopping_criteria=self._create_stopping_criteria(stop) if stop else None,
-                        logits_processor=LogitsProcessorList([InvalidLogitsProcessor()]),
-                    )
-                generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
-                return self.tokenizer.decode(generated_tokens, skip_special_tokens=True)
-            else:
-                raise
-    def _create_stopping_criteria(self, stop_sequences):
+    async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
-        """Create stopping criteria for stop sequences."""
+                              temperature: float = 0.7, top_p: float = 1.0,
-        if not stop_sequences:
+                              stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
-            return None
+        """Generate text in streaming fashion using llama-cpp."""
+        if max_tokens is None:
+            max_tokens = 512
-        class StopOnSequence(StoppingCriteria):
+        stream = self.model(
-            def __init__(self, stop_sequences, tokenizer):
+            prompt,
-                self.stop_sequences = stop_sequences
+            max_tokens=max_tokens,
-                self.tokenizer = tokenizer
+            temperature=temperature,
+            top_p=top_p,
-            def __call__(self, input_ids, scores, **kwargs):
+            stop=stop or [],
-                decoded = self.tokenizer.decode(input_ids[0][-20:], skip_special_tokens=True)
+            stream=True,
-                return any(seq in decoded for seq in self.stop_sequences)
+        )
+        for chunk in stream:
+            text = chunk["choices"][0].get("text", "")
+            if text:
+                yield text
+    def get_model_name(self) -> str:
+        return self.model_name or "unknown"
+    def cleanup(self) -> None:
+        if self.model is not None:
+            del self.model
+            self.model = None
+# =============================================================================
+# Model Manager
+# =============================================================================
+class ModelManager:
+    """Manages the loaded model and tokenizer."""
+    def __init__(self):
+        self.backend: Optional[ModelBackend] = None
+        self.backend_type: Optional[str] = None
+        self.tool_parser = ToolCallParser()
+    def load_model(self, model_name: str, backend_type: str = "auto", **kwargs):
+        """
+        Load the model with the specified backend.
+        Args:
+            model_name: Model name or path
+            backend_type: 'nvidia', 'vulkan', or 'auto' to detect
+            **kwargs: Additional arguments for the specific backend
+        """
+        available = detect_available_backends()
+        # Determine backend
+        if backend_type == "auto":
+            if available.get('nvidia'):
+                backend_type = "nvidia"
+                print("Auto-detected NVIDIA backend")
+            elif available.get('vulkan'):
+                backend_type = "vulkan"
+                print("Auto-detected Vulkan backend")
+            else:
+                print("No GPU backend detected. For NVIDIA, install PyTorch with CUDA.")
+                print("For Vulkan, install llama-cpp-python with Vulkan support.")
+                raise RuntimeError("No suitable backend found")
+        self.backend_type = backend_type
+        # Create appropriate backend
+        if backend_type == "nvidia":
+            if not available.get('nvidia'):
+                raise RuntimeError("NVIDIA backend requested but PyTorch/CUDA not available")
+            self.backend = NvidiaBackend()
+        elif backend_type == "vulkan":
+            if not available.get('vulkan'):
+                raise RuntimeError("Vulkan backend requested but llama-cpp-python not available")
+            self.backend = VulkanBackend()
+        else:
+            raise ValueError(f"Unknown backend: {backend_type}")
+        # Load the model
+        self.backend.load_model(model_name, **kwargs)
+        self.tool_parser = ToolCallParser()
-        return StoppingCriteriaList([StopOnSequence(stop_sequences, self.tokenizer)])
+    def format_messages(self, messages: List[ChatMessage]) -> str:
+        """Format messages into a prompt string."""
+        if self.backend is None:
+            raise RuntimeError("No model loaded")
+        return self.backend.format_messages(messages)
+    def generate(self, prompt: str, max_tokens: Optional[int] = None,
+                 temperature: float = 0.7, top_p: float = 1.0,
+                 stop: Optional[List[str]] = None) -> str:
+        """Generate text non-streaming."""
+        if self.backend is None:
+            raise RuntimeError("No model loaded")
+        return self.backend.generate(prompt, max_tokens, temperature, top_p, stop)
+    async def generate_stream(self, prompt: str, max_tokens: Optional[int] = None,
+                              temperature: float = 0.7, top_p: float = 1.0,
+                              stop: Optional[List[str]] = None) -> AsyncGenerator[str, None]:
+        """Generate text in streaming fashion."""
+        if self.backend is None:
+            raise RuntimeError("No model loaded")
+        async for chunk in self.backend.generate_stream(prompt, max_tokens, temperature, top_p, stop):
+            yield chunk
+    @property
+    def model_name(self) -> str:
+        if self.backend is None:
+            return "unknown"
+        return self.backend.get_model_name()
+    @property
+    def model(self):
+        if self.backend is None:
+            return None
+        return self.backend
+    @property
+    def tokenizer(self):
+        # Only NVIDIA backend has a tokenizer
+        if isinstance(self.backend, NvidiaBackend):
+            return self.backend.tokenizer
+        return None
+    def cleanup(self):
+        if self.backend is not None:
+            self.backend.cleanup()
+            self.backend = None
 # Global model manager
@@ -822,16 +862,13 @@ async def lifespan(app: FastAPI):
    # Startup
    yield
    # Shutdown
-    if model_manager.model is not None:
+    model_manager.cleanup()
-        del model_manager.model
-        del model_manager.tokenizer
-        torch.cuda.empty_cache() if torch.cuda.is_available() else None
 app = FastAPI(
    title="OpenAI-Compatible API",
-    description="OpenAI-compatible API for HuggingFace models with memory-aware loading",
+    description="OpenAI-compatible API supporting NVIDIA (CUDA) and Vulkan backends",
-    version="1.0.0",
+    version="2.0.0",
    lifespan=lifespan,
 )
@@ -850,7 +887,7 @@ async def list_models():
 @app.post("/v1/chat/completions")
 async def chat_completions(request: ChatCompletionRequest):
    """Chat completions endpoint with streaming and tool support."""
-    if model_manager.model is None:
+    if model_manager.backend is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    # Format messages with tools if provided
@@ -910,7 +947,7 @@ async def stream_chat_response(
    generated_text = ""
    try:
-        for chunk in model_manager.generate_stream(
+        async for chunk in model_manager.generate_stream(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
@@ -936,7 +973,6 @@ async def stream_chat_response(
        if tools:
            tool_calls = model_manager.tool_parser.extract_tool_calls(generated_text, tools)
            if tool_calls:
-                # Send tool calls as final delta
                data = {
                    "id": completion_id,
                    "object": "chat.completion.chunk",
@@ -957,7 +993,6 @@ async def stream_chat_response(
        yield "data: [DONE]\n\n"
    except Exception as e:
        print(f"Error during streaming generation: {e}")
-        # Send error event
        data = {
            "id": completion_id,
            "object": "chat.completion.chunk",
@@ -1010,6 +1045,15 @@ async def generate_chat_response(
                response_message["content"] = None
                finish_reason = "tool_calls"
+        # Calculate token counts if tokenizer available
+        if model_manager.tokenizer:
+            prompt_tokens = len(model_manager.tokenizer.encode(prompt))
+            completion_tokens = len(model_manager.tokenizer.encode(generated_text))
+        else:
+            # Rough estimate for Vulkan backend
+            prompt_tokens = len(prompt.split())
+            completion_tokens = len(generated_text.split())
        return {
            "id": completion_id,
            "object": "chat.completion",
@@ -1021,9 +1065,9 @@ async def generate_chat_response(
                "finish_reason": finish_reason,
            }],
            "usage": {
-                "prompt_tokens": len(model_manager.tokenizer.encode(prompt)),
+                "prompt_tokens": prompt_tokens,
-                "completion_tokens": len(model_manager.tokenizer.encode(generated_text)),
+                "completion_tokens": completion_tokens,
-                "total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)),
+                "total_tokens": prompt_tokens + completion_tokens,
            },
        }
    except Exception as e:
@@ -1034,7 +1078,7 @@ async def generate_chat_response(
 @app.post("/v1/completions")
 async def completions(request: CompletionRequest):
    """Text completions endpoint."""
-    if model_manager.model is None:
+    if model_manager.backend is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    prompts = request.prompt if isinstance(request.prompt, list) else [request.prompt]
@@ -1078,7 +1122,7 @@ async def stream_completion_response(
    created = int(time.time())
    try:
-        for chunk in model_manager.generate_stream(
+        async for chunk in model_manager.generate_stream(
            prompt=prompt,
            max_tokens=max_tokens,
            temperature=temperature,
@@ -1128,6 +1172,14 @@ async def generate_completion_response(
            stop=stop,
        )
+        # Calculate token counts if tokenizer available
+        if model_manager.tokenizer:
+            prompt_tokens = len(model_manager.tokenizer.encode(prompt))
+            completion_tokens = len(model_manager.tokenizer.encode(generated_text))
+        else:
+            prompt_tokens = len(prompt.split())
+            completion_tokens = len(generated_text.split())
        return {
            "id": completion_id,
            "object": "text_completion",
@@ -1140,9 +1192,9 @@ async def generate_completion_response(
                "finish_reason": "stop",
            }],
            "usage": {
-                "prompt_tokens": len(model_manager.tokenizer.encode(prompt)),
+                "prompt_tokens": prompt_tokens,
-                "completion_tokens": len(model_manager.tokenizer.encode(generated_text)),
+                "completion_tokens": completion_tokens,
-                "total_tokens": len(model_manager.tokenizer.encode(prompt)) + len(model_manager.tokenizer.encode(generated_text)),
+                "total_tokens": prompt_tokens + completion_tokens,
            },
        }
    except Exception as e:
@@ -1157,13 +1209,20 @@ async def generate_completion_response(
 def parse_args():
    """Parse command line arguments."""
    parser = argparse.ArgumentParser(
-        description="OpenAI-compatible API server with memory-aware model loading"
+        description="OpenAI-compatible API server supporting NVIDIA (CUDA) and Vulkan backends"
    )
    parser.add_argument(
        "--model",
        type=str,
        default=None,
-        help="HuggingFace model name or path",
+        help="Model name or path. For NVIDIA: HuggingFace model. For Vulkan: GGUF file path or HF repo",
+    )
+    parser.add_argument(
+        "--backend",
+        type=str,
+        choices=["auto", "nvidia", "vulkan"],
+        default="auto",
+        help="Backend to use: auto (detect), nvidia (CUDA), or vulkan (AMD GPUs)",
    )
    parser.add_argument(
        "--host",
@@ -1181,68 +1240,116 @@ def parse_args():
        "--offload-dir",
        type=str,
        default="./offload",
-        help="Directory for disk offload when model doesn't fit in VRAM+RAM (default: ./offload)",
+        help="Directory for disk offload (NVIDIA backend only, default: ./offload)",
    )
    parser.add_argument(
        "--load-in-4bit",
        action="store_true",
-        help="Load model in 4-bit precision (requires bitsandbytes)",
+        help="Load model in 4-bit precision (NVIDIA backend only, requires bitsandbytes)",
    )
    parser.add_argument(
        "--load-in-8bit",
        action="store_true",
-        help="Load model in 8-bit precision (requires bitsandbytes)",
+        help="Load model in 8-bit precision (NVIDIA backend only, requires bitsandbytes)",
    )
    parser.add_argument(
        "--ram",
        type=float,
        default=None,
-        help="Manually specify available RAM in GB (bypasses auto-detection)",
+        help="Manually specify available RAM in GB (NVIDIA backend only)",
    )
    parser.add_argument(
        "--flash-attn",
        action="store_true",
-        help="Use Flash Attention 2 for faster inference (requires flash-attn package and compatible GPU)",
+        help="Use Flash Attention 2 (NVIDIA backend only, requires flash-attn package)",
+    )
+    parser.add_argument(
+        "--n-gpu-layers",
+        type=int,
+        default=-1,
+        help="Number of layers to offload to GPU (Vulkan backend only, default: -1 = all layers)",
+    )
+    parser.add_argument(
+        "--n-ctx",
+        type=int,
+        default=2048,
+        help="Context window size (Vulkan backend only, default: 2048)",
    )
    return parser.parse_args()
 def main():
    """Main entry point."""
-    import procname
+    # Optional: set process name if procname is available
-    procname.setprocname("coderai")
+    try:
+        import procname
+        procname.setprocname("coderai")
+    except ImportError:
+        pass
    args = parse_args()
    # Get model name from args or prompt interactively
    model_name = args.model
    if model_name is None:
-        print("No model specified. Please enter a HuggingFace model name.")
+        print("No model specified. Please enter a model name.")
-        print("Examples:")
+        print("")
+        print("For NVIDIA backend (HuggingFace models):")
        print("  - microsoft/DialoGPT-medium")
-        print("  - facebook/blenderbot-400M-distill")
        print("  - meta-llama/Llama-2-7b-chat-hf (requires auth)")
        print("  - TinyLlama/TinyLlama-1.1B-Chat-v1.0")
        print("")
+        print("For Vulkan backend (GGUF models):")
+        print("  - Local path: ./phi-3-mini-4k-instruct-q4_k_m.gguf")
+        print("  - HuggingFace: microsoft/Phi-3-mini-4k-instruct-gguf")
+        print("")
        model_name = input("Enter model name: ").strip()
        if not model_name:
            print("Error: Model name is required")
            sys.exit(1)
-    # Load the model with memory-aware offload
+    # Detect available backends
-    model_manager.load_model(
+    available = detect_available_backends()
-        model_name=model_name,
+    print("\nAvailable backends:")
-        offload_dir=args.offload_dir,
+    for name, available_flag in available.items():
-        load_in_4bit=args.load_in_4bit,
+        status = "✓" if available_flag else "✗"
-        load_in_8bit=args.load_in_8bit,
+        print(f"  [{status}] {name}")
-        manual_ram_gb=args.ram,
+    print("")
-        flash_attn=getattr(args, 'flash_attn', False),
-    )
+    # Load the model
+    load_kwargs = {
+        'offload_dir': args.offload_dir,
+        'load_in_4bit': args.load_in_4bit,
+        'load_in_8bit': args.load_in_8bit,
+        'manual_ram_gb': args.ram,
+        'flash_attn': args.flash_attn,
+        'n_gpu_layers': args.n_gpu_layers,
+        'n_ctx': args.n_ctx,
+    }
+    try:
+        model_manager.load_model(
+            model_name=model_name,
+            backend_type=args.backend,
+            **load_kwargs
+        )
+    except Exception as e:
+        print(f"\nError loading model: {e}")
+        print("\nTroubleshooting:")
+        if args.backend == "vulkan":
+            print("  - For Vulkan, ensure you have Vulkan drivers installed")
+            print("  - Make sure you're using a GGUF format model")
+            print("  - Run build.sh with 'vulkan' argument first")
+        else:
+            print("  - For NVIDIA, ensure PyTorch with CUDA is installed")
+            print("  - Run build.sh with 'nvidia' argument first")
+        sys.exit(1)
    # Start the server
    import uvicorn
    print(f"\nStarting server on http://{args.host}:{args.port}")
    print(f"API documentation available at http://{args.host}:{args.port}/docs")
+    print(f"Using backend: {model_manager.backend_type}")
    uvicorn.run(app, host=args.host, port=args.port)

--- a/requirements-nvidia.txt
+++ b/requirements-nvidia.txt
+# FastAPI and server dependencies
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.5.0
+# ML dependencies (transformers-based for NVIDIA/CUDA)
+transformers>=4.35.0
+accelerate>=0.24.0
+# System resource detection
+psutil>=5.9.0
+procname>=0.3.0  # optional - for setting process name
+# Optional: for better performance with NVIDIA GPUs
+bitsandbytes>=0.41.0
+sentencepiece>=0.1.99
+protobuf>=3.20.0
+# Optional: Flash Attention 2 for faster inference on supported NVIDIA GPUs
+# Requires specific CUDA versions and may need manual installation
+# Install with: pip install flash-attn --no-build-isolation
+# flash-attn>=2.5.0
--- a/requirements-vulkan.txt
+++ b/requirements-vulkan.txt
+# FastAPI and server dependencies
+fastapi>=0.104.0
+uvicorn[standard]>=0.24.0
+pydantic>=2.5.0
+# llama-cpp-python is installed by build.sh with Vulkan support
+# CMAKE_ARGS="-DGGML_VULKAN=ON" pip install llama-cpp-python --no-cache-dir
+# System resource detection
+psutil>=5.9.0
+procname>=0.3.0  # optional - for setting process name
+# HuggingFace Hub for downloading GGUF models
+huggingface-hub>=0.19.0
+# No PyTorch needed for Vulkan backend - llama-cpp handles everything