Add --vulkan-single-gpu flag to force Vulkan to use only one GPU

When multiple Vulkan-compatible GPUs are present (e.g., NVIDIA + AMD),
llama.cpp automatically distributes layers across all GPUs for performance.
This can cause unwanted VRAM allocation on the NVIDIA GPU when the user
wants to use only the AMD GPU.

The new --vulkan-single-gpu flag uses tensor_split to force all model
layers onto a single specified GPU device, preventing distribution.

- Added --vulkan-single-gpu argument
- Added count_vulkan_devices() method to detect GPU count
- Modified load_model to build tensor_split array when single_gpu=True
- Updated README with documentation for the new flag

Example usage:
  python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu
parent e0f0e99d
......@@ -184,6 +184,7 @@ options:
default: -1 = all layers)
--n-ctx N Context window size (Vulkan only, default: 2048)
--vulkan-device N Vulkan GPU device ID to use (Vulkan only, default: 0)
--vulkan-single-gpu Force Vulkan to use only the specified GPU (prevents layer distribution across multiple GPUs)
--vulkan-list-devices List available Vulkan GPU devices and exit
```
......@@ -441,9 +442,18 @@ python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage
### Using Vulkan with Multiple GPUs (NVIDIA + AMD)
If your system has both NVIDIA and AMD GPUs, Vulkan may allocate some resources on all visible GPUs. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:
If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:
**Method 1: Use environment variable to select specific Vulkan device**
**Method 1: Use `--vulkan-single-gpu` flag (Recommended)**
```bash
# Force all layers onto the specified GPU device only
# For example, to use only device 1 (AMD GPU):
python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu --port 6744
# This creates a tensor_split that puts 0% on other GPUs and 100% on the selected GPU
```
**Method 2: Use environment variable to select specific Vulkan device**
```bash
# List available Vulkan devices first
python coderai --vulkan-list-devices
......@@ -453,30 +463,19 @@ python coderai --vulkan-list-devices
VK_DEVICE_SELECT_DEVICE=1 python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
```
**Method 2: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)**
**Method 3: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)**
```bash
# Make NVIDIA GPU invisible to CUDA/Vulkan
CUDA_VISIBLE_DEVICES="" python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
```
**Method 3: Use llama-cpp-python's device filtering (in code)**
```python
# In your own scripts using llama-cpp-python directly:
from llama_cpp import Llama
# main_gpu parameter selects which Vulkan device to use
llm = Llama(
model_path="./model.gguf",
n_gpu_layers=-1,
n_ctx=2048,
main_gpu=0, # Use first Vulkan device (should be AMD if NVIDIA is hidden)
)
```
**Understanding the Issue:**
When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `--vulkan-single-gpu` flag prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.
**Notes:**
- The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python
- The `--vulkan-single-gpu` flag builds a `tensor_split` array to force single GPU usage
- Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs
- If you see VRAM allocated on both GPUs, use `VK_DEVICE_SELECT_DEVICE` or hide NVIDIA from CUDA
- The `vulkaninfo` command shows all GPUs visible to Vulkan
### Multi-GPU Setup
......
......@@ -623,6 +623,25 @@ class VulkanBackend(ModelBackend):
except Exception:
pass
def count_vulkan_devices(self):
"""Count the number of Vulkan GPU devices available."""
try:
from llama_cpp import llama_get_devices
devices = llama_get_devices()
return len(devices)
except:
# Fallback: try to parse vulkaninfo
try:
import subprocess
result = subprocess.run(['vulkaninfo', '--summary'], capture_output=True, text=True)
if result.returncode == 0:
# Count GPU devices in output
gpu_count = result.stdout.count('GPU') + result.stdout.count('device')
return max(gpu_count, 1)
except:
pass
return 2 # Default to 2 if we can't detect
def load_model(self, model_name: str, **kwargs) -> None:
"""Load a GGUF model using llama-cpp-python."""
from llama_cpp import Llama
......@@ -686,14 +705,39 @@ class VulkanBackend(ModelBackend):
# List available devices for user reference
self.list_vulkan_devices()
# Check if single GPU mode is requested
single_gpu = kwargs.get('single_gpu', False)
tensor_split = None
if single_gpu:
# Build tensor_split to force all layers onto one GPU
# We need to detect how many GPUs are visible to Vulkan
num_devices = self.count_vulkan_devices()
# Create tensor_split array: 1.0 for selected GPU, 0.0 for others
tensor_split = [0.0] * num_devices
if main_gpu < len(tensor_split):
tensor_split[main_gpu] = 1.0
else:
print(f"Warning: main_gpu={main_gpu} exceeds detected devices ({num_devices})")
tensor_split = None
if tensor_split:
print(f" Single GPU mode: Forcing all layers to GPU {main_gpu}")
print(f" Tensor split: {tensor_split}")
try:
self.model = Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers,
n_ctx=n_ctx,
verbose=verbose,
main_gpu=main_gpu,
)
llama_kwargs = {
'model_path': model_path,
'n_gpu_layers': n_gpu_layers,
'n_ctx': n_ctx,
'verbose': verbose,
'main_gpu': main_gpu,
}
if tensor_split:
llama_kwargs['tensor_split'] = tensor_split
self.model = Llama(**llama_kwargs)
self.model_name = model_name
print("\nModel loaded successfully with Vulkan!")
except Exception as e:
......@@ -1301,6 +1345,11 @@ def parse_args():
default=0,
help="Vulkan GPU device ID to use (Vulkan backend only, default: 0). Use --vulkan-list-devices to see available devices",
)
parser.add_argument(
"--vulkan-single-gpu",
action="store_true",
help="Force Vulkan to use only the specified GPU device (prevents layer distribution across multiple GPUs)",
)
parser.add_argument(
"--vulkan-list-devices",
action="store_true",
......@@ -1371,6 +1420,7 @@ def main():
'n_gpu_layers': args.n_gpu_layers,
'n_ctx': args.n_ctx,
'main_gpu': args.vulkan_device,
'single_gpu': args.vulkan_single_gpu,
}
try:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment