Add --vulkan-single-gpu flag to force Vulkan to use only one GPU

When multiple Vulkan-compatible GPUs are present (e.g., NVIDIA + AMD),
llama.cpp automatically distributes layers across all GPUs for performance.
This can cause unwanted VRAM allocation on the NVIDIA GPU when the user
wants to use only the AMD GPU.

The new --vulkan-single-gpu flag uses tensor_split to force all model
layers onto a single specified GPU device, preventing distribution.

- Added --vulkan-single-gpu argument
- Added count_vulkan_devices() method to detect GPU count
- Modified load_model to build tensor_split array when single_gpu=True
- Updated README with documentation for the new flag

Example usage:
  python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu
parent e0f0e99d
...@@ -184,6 +184,7 @@ options: ...@@ -184,6 +184,7 @@ options:
default: -1 = all layers) default: -1 = all layers)
--n-ctx N Context window size (Vulkan only, default: 2048) --n-ctx N Context window size (Vulkan only, default: 2048)
--vulkan-device N Vulkan GPU device ID to use (Vulkan only, default: 0) --vulkan-device N Vulkan GPU device ID to use (Vulkan only, default: 0)
--vulkan-single-gpu Force Vulkan to use only the specified GPU (prevents layer distribution across multiple GPUs)
--vulkan-list-devices List available Vulkan GPU devices and exit --vulkan-list-devices List available Vulkan GPU devices and exit
``` ```
...@@ -441,9 +442,18 @@ python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage ...@@ -441,9 +442,18 @@ python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage
### Using Vulkan with Multiple GPUs (NVIDIA + AMD) ### Using Vulkan with Multiple GPUs (NVIDIA + AMD)
If your system has both NVIDIA and AMD GPUs, Vulkan may allocate some resources on all visible GPUs. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU: If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:
**Method 1: Use environment variable to select specific Vulkan device** **Method 1: Use `--vulkan-single-gpu` flag (Recommended)**
```bash
# Force all layers onto the specified GPU device only
# For example, to use only device 1 (AMD GPU):
python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu --port 6744
# This creates a tensor_split that puts 0% on other GPUs and 100% on the selected GPU
```
**Method 2: Use environment variable to select specific Vulkan device**
```bash ```bash
# List available Vulkan devices first # List available Vulkan devices first
python coderai --vulkan-list-devices python coderai --vulkan-list-devices
...@@ -453,30 +463,19 @@ python coderai --vulkan-list-devices ...@@ -453,30 +463,19 @@ python coderai --vulkan-list-devices
VK_DEVICE_SELECT_DEVICE=1 python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744 VK_DEVICE_SELECT_DEVICE=1 python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
``` ```
**Method 2: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)** **Method 3: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)**
```bash ```bash
# Make NVIDIA GPU invisible to CUDA/Vulkan # Make NVIDIA GPU invisible to CUDA/Vulkan
CUDA_VISIBLE_DEVICES="" python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744 CUDA_VISIBLE_DEVICES="" python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
``` ```
**Method 3: Use llama-cpp-python's device filtering (in code)** **Understanding the Issue:**
```python When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `--vulkan-single-gpu` flag prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.
# In your own scripts using llama-cpp-python directly:
from llama_cpp import Llama
# main_gpu parameter selects which Vulkan device to use
llm = Llama(
model_path="./model.gguf",
n_gpu_layers=-1,
n_ctx=2048,
main_gpu=0, # Use first Vulkan device (should be AMD if NVIDIA is hidden)
)
```
**Notes:** **Notes:**
- The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python - The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python
- The `--vulkan-single-gpu` flag builds a `tensor_split` array to force single GPU usage
- Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs - Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs
- If you see VRAM allocated on both GPUs, use `VK_DEVICE_SELECT_DEVICE` or hide NVIDIA from CUDA
- The `vulkaninfo` command shows all GPUs visible to Vulkan - The `vulkaninfo` command shows all GPUs visible to Vulkan
### Multi-GPU Setup ### Multi-GPU Setup
......
...@@ -622,7 +622,26 @@ class VulkanBackend(ModelBackend): ...@@ -622,7 +622,26 @@ class VulkanBackend(ModelBackend):
print(result.stdout) print(result.stdout)
except Exception: except Exception:
pass pass
def count_vulkan_devices(self):
"""Count the number of Vulkan GPU devices available."""
try:
from llama_cpp import llama_get_devices
devices = llama_get_devices()
return len(devices)
except:
# Fallback: try to parse vulkaninfo
try:
import subprocess
result = subprocess.run(['vulkaninfo', '--summary'], capture_output=True, text=True)
if result.returncode == 0:
# Count GPU devices in output
gpu_count = result.stdout.count('GPU') + result.stdout.count('device')
return max(gpu_count, 1)
except:
pass
return 2 # Default to 2 if we can't detect
def load_model(self, model_name: str, **kwargs) -> None: def load_model(self, model_name: str, **kwargs) -> None:
"""Load a GGUF model using llama-cpp-python.""" """Load a GGUF model using llama-cpp-python."""
from llama_cpp import Llama from llama_cpp import Llama
...@@ -686,14 +705,39 @@ class VulkanBackend(ModelBackend): ...@@ -686,14 +705,39 @@ class VulkanBackend(ModelBackend):
# List available devices for user reference # List available devices for user reference
self.list_vulkan_devices() self.list_vulkan_devices()
# Check if single GPU mode is requested
single_gpu = kwargs.get('single_gpu', False)
tensor_split = None
if single_gpu:
# Build tensor_split to force all layers onto one GPU
# We need to detect how many GPUs are visible to Vulkan
num_devices = self.count_vulkan_devices()
# Create tensor_split array: 1.0 for selected GPU, 0.0 for others
tensor_split = [0.0] * num_devices
if main_gpu < len(tensor_split):
tensor_split[main_gpu] = 1.0
else:
print(f"Warning: main_gpu={main_gpu} exceeds detected devices ({num_devices})")
tensor_split = None
if tensor_split:
print(f" Single GPU mode: Forcing all layers to GPU {main_gpu}")
print(f" Tensor split: {tensor_split}")
try: try:
self.model = Llama( llama_kwargs = {
model_path=model_path, 'model_path': model_path,
n_gpu_layers=n_gpu_layers, 'n_gpu_layers': n_gpu_layers,
n_ctx=n_ctx, 'n_ctx': n_ctx,
verbose=verbose, 'verbose': verbose,
main_gpu=main_gpu, 'main_gpu': main_gpu,
) }
if tensor_split:
llama_kwargs['tensor_split'] = tensor_split
self.model = Llama(**llama_kwargs)
self.model_name = model_name self.model_name = model_name
print("\nModel loaded successfully with Vulkan!") print("\nModel loaded successfully with Vulkan!")
except Exception as e: except Exception as e:
...@@ -1301,6 +1345,11 @@ def parse_args(): ...@@ -1301,6 +1345,11 @@ def parse_args():
default=0, default=0,
help="Vulkan GPU device ID to use (Vulkan backend only, default: 0). Use --vulkan-list-devices to see available devices", help="Vulkan GPU device ID to use (Vulkan backend only, default: 0). Use --vulkan-list-devices to see available devices",
) )
parser.add_argument(
"--vulkan-single-gpu",
action="store_true",
help="Force Vulkan to use only the specified GPU device (prevents layer distribution across multiple GPUs)",
)
parser.add_argument( parser.add_argument(
"--vulkan-list-devices", "--vulkan-list-devices",
action="store_true", action="store_true",
...@@ -1371,6 +1420,7 @@ def main(): ...@@ -1371,6 +1420,7 @@ def main():
'n_gpu_layers': args.n_gpu_layers, 'n_gpu_layers': args.n_gpu_layers,
'n_ctx': args.n_ctx, 'n_ctx': args.n_ctx,
'main_gpu': args.vulkan_device, 'main_gpu': args.vulkan_device,
'single_gpu': args.vulkan_single_gpu,
} }
try: try:
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment