Fix NaN/inf probability tensor error during generation

- Add InvalidLogitsProcessor to replace NaN and Inf values with finite numbers
- Add _validate_generation_params() to clamp temperature and top_p to valid ranges
- Add try-except blocks with fallback to greedy decoding on numerical errors
- Add error handling in streaming responses to prevent crashes
- Fix temperature=0 handling to use greedy decoding instead of sampling
parent 087ba9e1
...@@ -21,7 +21,7 @@ An OpenAI-compatible API server for HuggingFace models with intelligent memory m ...@@ -21,7 +21,7 @@ An OpenAI-compatible API server for HuggingFace models with intelligent memory m
- Python 3.8+ - Python 3.8+
- For NVIDIA GPUs: CUDA toolkit (11.8+ recommended) - For NVIDIA GPUs: CUDA toolkit (11.8+ recommended)
- For AMD GPUs: ROCm (5.4+ recommended) - For AMD GPUs: ROCm (5.6+ recommended, 6.0+ preferred)
- For CPU-only: No additional requirements - For CPU-only: No additional requirements
### Basic Installation ### Basic Installation
...@@ -43,36 +43,44 @@ pip install -r requirements.txt ...@@ -43,36 +43,44 @@ pip install -r requirements.txt
PyTorch installation varies by platform. Uncomment the appropriate section in [`requirements.txt`](requirements.txt) or install manually: PyTorch installation varies by platform. Uncomment the appropriate section in [`requirements.txt`](requirements.txt) or install manually:
> **⚠️ WARNING: Shell Redirection Issue**
> When using `>=` in pip commands, always use **quotes** around the package specifier!
> Without quotes, the shell interprets `>` as output redirection.
>
> ❌ Wrong: `pip install torch>=2.0.0` (creates file named "=2.0.0")
> ✅ Correct: `pip install "torch>=2.0.0"` (with quotes)
> ✅ Also correct: `pip install torch==2.0.0` (exact version, no >=)
#### NVIDIA (CUDA) #### NVIDIA (CUDA)
```bash ```bash
# For CUDA 11.8 # For CUDA 11.8
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu118
# For CUDA 12.1 # For CUDA 12.1
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121
# For CUDA 12.4 (latest) # For CUDA 12.4 (latest)
pip install torch torchvision torchaudio pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0"
``` ```
#### AMD (ROCm) #### AMD (ROCm)
```bash ```bash
# For ROCm 5.4.2 # For ROCm 6.0 (recommended for newer AMD GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0
# For ROCm 5.6
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.6
# For ROCm 6.0 # For ROCm 5.6 (for older AMD GPUs)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm5.6
``` ```
> **Note**: ROCm 5.4.2 is deprecated. Use ROCm 5.6 or 6.0 for better compatibility.
> Check available versions at: https://pytorch.org/get-started/locally/
#### CPU Only #### CPU Only
```bash ```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
``` ```
### Optional Dependencies ### Optional Dependencies
...@@ -83,7 +91,7 @@ For 4-bit and 8-bit quantization support (reduces VRAM requirements): ...@@ -83,7 +91,7 @@ For 4-bit and 8-bit quantization support (reduces VRAM requirements):
```bash ```bash
# CUDA # CUDA
pip install bitsandbytes>=0.41.0 pip install "bitsandbytes>=0.41.0"
# ROCm support may require building from source # ROCm support may require building from source
# See: https://github.com/TimDettmers/bitsandbytes # See: https://github.com/TimDettmers/bitsandbytes
...@@ -272,7 +280,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \ ...@@ -272,7 +280,7 @@ curl -X POST http://localhost:8000/v1/chat/completions \
```bash ```bash
# Install CUDA-enabled PyTorch # Install CUDA-enabled PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cu121
# Run with GPU acceleration (automatic) # Run with GPU acceleration (automatic)
python coderai --model meta-llama/Llama-2-7b-chat-hf python coderai --model meta-llama/Llama-2-7b-chat-hf
...@@ -284,8 +292,8 @@ python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn ...@@ -284,8 +292,8 @@ python coderai --model meta-llama/Llama-2-7b-chat-hf --flash-attn
### ROCm (AMD GPU) ### ROCm (AMD GPU)
```bash ```bash
# Install ROCm-enabled PyTorch # Install ROCm-enabled PyTorch (use 6.0 for newer GPUs, 5.6 for older)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/rocm6.0
# Run with GPU acceleration (automatic) # Run with GPU acceleration (automatic)
python coderai --model meta-llama/Llama-2-7b-chat-hf python coderai --model meta-llama/Llama-2-7b-chat-hf
...@@ -297,7 +305,7 @@ python coderai --model meta-llama/Llama-2-7b-chat-hf ...@@ -297,7 +305,7 @@ python coderai --model meta-llama/Llama-2-7b-chat-hf
```bash ```bash
# Install CPU-only PyTorch # Install CPU-only PyTorch
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu pip install "torch>=2.0.0" "torchvision>=0.15.0" "torchaudio>=2.0.0" --index-url https://download.pytorch.org/whl/cpu
# Run on CPU (automatic fallback) # Run on CPU (automatic fallback)
python coderai --model microsoft/DialoGPT-medium python coderai --model microsoft/DialoGPT-medium
...@@ -352,6 +360,17 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit ...@@ -352,6 +360,17 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
## Troubleshooting ## Troubleshooting
### Shell Redirection Error: "No such file or directory: '0.0'"
**Problem**: Running `pip install torch>=2.0.0` fails with an error about file "0.0" or "=2.0.0" not found.
**Cause**: The shell interprets `>` as output redirection. The command creates a file named "=2.0.0" and installs an unversioned torch package.
**Solutions**:
1. **Use quotes** (recommended): `pip install "torch>=2.0.0"`
2. **Use exact versions**: `pip install torch==2.0.0`
3. **Use requirements.txt**: Add exact versions to requirements.txt and run `pip install -r requirements.txt`
### Out of Memory Errors ### Out of Memory Errors
**Problem**: `CUDA out of memory` or system RAM exhausted **Problem**: `CUDA out of memory` or system RAM exhausted
...@@ -373,6 +392,33 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit ...@@ -373,6 +392,33 @@ python coderai --model meta-llama/Llama-2-70b-chat-hf --load-in-8bit
4. Check GPU compatibility (Ampere, Ada Lovelace, Hopper for NVIDIA) 4. Check GPU compatibility (Ampere, Ada Lovelace, Hopper for NVIDIA)
5. Skip Flash Attention - the server works without it 5. Skip Flash Attention - the server works without it
### Flash Attention: No module named 'torch' during build
**Problem**: Flash Attention build fails with `ModuleNotFoundError: No module named 'torch'` even though PyTorch is installed (e.g., PyTorch 2.9.1+rocm6.4).
**Cause**: pip uses isolated build environments by default, which prevents flash-attention from seeing the installed torch package during compilation.
**Solutions**:
1. **Use --no-build-isolation flag** (recommended):
```bash
pip install flash-attn --no-build-isolation
```
2. **For ROCm systems**, you may also need to limit parallel jobs to avoid resource exhaustion:
```bash
MAX_JOBS=4 pip install flash-attn --no-build-isolation
```
3. **Use pre-built wheels** if available for your platform (check https://github.com/Dao-AILab/flash-attention/releases)
4. **ROCm 6.4 compatibility note**: Flash Attention may not officially support ROCm 6.4 yet (it was primarily built for ROCm 6.0). If build fails on ROCm 6.4, you can run without Flash Attention:
```bash
python coderai --model meta-llama/Llama-2-7b-chat-hf
# (omit the --flash-attn flag)
```
5. **Fallback**: The server works perfectly without Flash Attention - simply omit the `--flash-attn` flag when starting the server.
### bitsandbytes Not Working on ROCm ### bitsandbytes Not Working on ROCm
**Problem**: Quantization fails on AMD GPUs **Problem**: Quantization fails on AMD GPUs
......
This diff is collapsed.
...@@ -3,20 +3,27 @@ fastapi>=0.104.0 ...@@ -3,20 +3,27 @@ fastapi>=0.104.0
uvicorn[standard]>=0.24.0 uvicorn[standard]>=0.24.0
pydantic>=2.5.0 pydantic>=2.5.0
# PyTorch - Uncomment the appropriate version for your system: # PyTorch - Uncomment the appropriate version for your system.
# IMPORTANT: Use quotes around version specifiers to prevent shell interpretation!
# The >= operator will be interpreted as output redirection without quotes!
#
# Option 1: Use exact versions (recommended for requirements.txt)
# Option 2: Use quotes: pip install "torch>=2.0.0"
# For NVIDIA (CUDA): # For NVIDIA (CUDA):
# torch>=2.0.0 # torch==2.0.0
# torchvision>=0.15.0 # torchvision==0.15.0
# torchaudio>=2.0.0 # torchaudio==2.0.0
# For AMD (ROCm): # For AMD (ROCm) - see available versions at https://pytorch.org/get-started/locally/
# --index-url https://download.pytorch.org/whl/rocm5.4.2 # rocm6.0 is recommended for newer AMD GPUs, rocm5.6 for older ones
# torch>=2.0.0 # --index-url https://download.pytorch.org/whl/rocm6.0
# torchvision>=0.15.0 # torch==2.0.0
# torchaudio>=2.0.0 # torchvision==0.15.0
# torchaudio==2.0.0
# For CPU only: # For CPU only:
torch>=2.0.0 torch==2.0.0
# ML dependencies # ML dependencies
transformers>=4.35.0 transformers>=4.35.0
...@@ -37,6 +44,16 @@ procname>=0.3.0 ...@@ -37,6 +44,16 @@ procname>=0.3.0
# flash-attn>=2.5.0 # flash-attn>=2.5.0
# Installation instructions: # Installation instructions:
# 1. For NVIDIA GPUs: pip install torch torchvision torchaudio # IMPORTANT: Always use quotes or exact versions to avoid shell redirection issues!
# 2. For AMD GPUs: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 #
# 3. For CPU only: pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu # 1. For NVIDIA GPUs (CUDA 12.1):
# pip install torch torchvision torchaudio
#
# 2. For AMD GPUs (ROCm 6.0 recommended):
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
#
# 3. For CPU only:
# pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
#
# If you see "No such file or directory: '0.0'" errors, you forgot to use quotes!
# The shell interprets >= as redirection. Fix: pip install "torch>=2.0.0" (with quotes)
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment