Add --vulkan-single-gpu flag to force Vulkan to use only one GPU

When multiple Vulkan-compatible GPUs are present (e.g., NVIDIA + AMD), llama.cpp automatically distributes layers across all GPUs for performance. This can cause unwanted VRAM allocation on the NVIDIA GPU when the user wants to use only the AMD GPU. The new --vulkan-single-gpu flag uses tensor_split to force all model layers onto a single specified GPU device, preventing distribution. - Added --vulkan-single-gpu argument - Added count_vulkan_devices() method to detect GPU count - Modified load_model to build tensor_split array when single_gpu=True - Updated README with documentation for the new flag Example usage: python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu

Add --vulkan-single-gpu flag to force Vulkan to use only one GPU
When multiple Vulkan-compatible GPUs are present (e.g., NVIDIA + AMD), llama.cpp automatically distributes layers across all GPUs for performance. This can cause unwanted VRAM allocation on the NVIDIA GPU when the user wants to use only the AMD GPU. The new --vulkan-single-gpu flag uses tensor_split to force all model layers onto a single specified GPU device, preventing distribution. - Added --vulkan-single-gpu argument - Added count_vulkan_devices() method to detect GPU count - Modified load_model to build tensor_split array when single_gpu=True - Updated README with documentation for the new flag Example usage: python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu
a62cb69d · Stefy Lanza (nextime / spora ) · e0f0e99d · a62cb69d · a62cb69d
Commit a62cb69d authored Feb 28, 2026 by Stefy Lanza (nextime / spora )
Show whitespace changes
Inline Side-by-side

Showing with 74 additions and 25 deletions

README.md README.md +16 -17

coderai coderai +58 -8

No files found.
--- a/README.md
+++ b/README.md
@@ -184,6 +184,7 @@ options:
                        default: -1 = all layers)
  --n-ctx N             Context window size (Vulkan only, default: 2048)
  --vulkan-device N     Vulkan GPU device ID to use (Vulkan only, default: 0)
+  --vulkan-single-gpu   Force Vulkan to use only the specified GPU (prevents layer distribution across multiple GPUs)
  --vulkan-list-devices List available Vulkan GPU devices and exit
 ```

@@ -441,9 +442,18 @@ python coderai --model bigscience/bloom-7b1 --offload-dir /path/to/fast/storage

 ### Using Vulkan with Multiple GPUs (NVIDIA + AMD)

-If your system has both NVIDIA and AMD GPUs, Vulkan may allocate some resources on all visible GPUs. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:
+If your system has both NVIDIA and AMD GPUs, llama.cpp's Vulkan backend will automatically distribute layers across all visible GPUs for performance. To force Vulkan to use **only** the AMD GPU and prevent VRAM allocation on the NVIDIA GPU:

-**Method 1: Use environment variable to select specific Vulkan device**
+**Method 1: Use `--vulkan-single-gpu` flag (Recommended)**
+```bash
+# Force all layers onto the specified GPU device only
+# For example, to use only device 1 (AMD GPU):
+python coderai --model model.gguf --backend vulkan --vulkan-device 1 --vulkan-single-gpu --port 6744
+
+# This creates a tensor_split that puts 0% on other GPUs and 100% on the selected GPU
+```
+
+**Method 2: Use environment variable to select specific Vulkan device**
 ```bash
 # List available Vulkan devices first
 python coderai --vulkan-list-devices
@@ -453,30 +463,19 @@ python coderai --vulkan-list-devices
 VK_DEVICE_SELECT_DEVICE=1 python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
 ```

-**Method 2: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)**
+**Method 3: Hide NVIDIA GPU from CUDA (prevents any CUDA usage)**
 ```bash
 # Make NVIDIA GPU invisible to CUDA/Vulkan
 CUDA_VISIBLE_DEVICES="" python coderai --model model.gguf --backend vulkan --vulkan-device 0 --port 6744
 ```

-**Method 3: Use llama-cpp-python's device filtering (in code)**
-```python
-# In your own scripts using llama-cpp-python directly:
-from llama_cpp import Llama
-
-# main_gpu parameter selects which Vulkan device to use
-llm = Llama(
-    model_path="./model.gguf",
-    n_gpu_layers=-1,
-    n_ctx=2048,
-    main_gpu=0,  # Use first Vulkan device (should be AMD if NVIDIA is hidden)
-)
-```
+**Understanding the Issue:**
+When you have multiple Vulkan-compatible GPUs, llama.cpp automatically distributes model layers across them (shown in logs as "layer X assigned to device VulkanY"). The `--vulkan-single-gpu` flag prevents this by using the `tensor_split` parameter with a value of `[0.0, 1.0]` (or similar depending on device count), which tells llama.cpp to put 0% of layers on some GPUs and 100% on the selected GPU.

 **Notes:**
 - The `--vulkan-device` argument maps to `main_gpu` in llama-cpp-python
+- The `--vulkan-single-gpu` flag builds a `tensor_split` array to force single GPU usage
 - Vulkan enumerates all GPUs in your system, so device IDs may differ from CUDA device IDs
- If you see VRAM allocated on both GPUs, use `VK_DEVICE_SELECT_DEVICE` or hide NVIDIA from CUDA
 - The `vulkaninfo` command shows all GPUs visible to Vulkan

 ### Multi-GPU Setup

--- a/coderai
+++ b/coderai
@@ -623,6 +623,25 @@ class VulkanBackend(ModelBackend):
        except Exception:
            pass
    
+    def count_vulkan_devices(self):
+        """Count the number of Vulkan GPU devices available."""
+        try:
+            from llama_cpp import llama_get_devices
+            devices = llama_get_devices()
+            return len(devices)
+        except:
+            # Fallback: try to parse vulkaninfo
+            try:
+                import subprocess
+                result = subprocess.run(['vulkaninfo', '--summary'], capture_output=True, text=True)
+                if result.returncode == 0:
+                    # Count GPU devices in output
+                    gpu_count = result.stdout.count('GPU') + result.stdout.count('device')
+                    return max(gpu_count, 1)
+            except:
+                pass
+        return 2  # Default to 2 if we can't detect
+    
    def load_model(self, model_name: str, **kwargs) -> None:
        """Load a GGUF model using llama-cpp-python."""
        from llama_cpp import Llama
@@ -686,14 +705,39 @@ class VulkanBackend(ModelBackend):
        # List available devices for user reference
        self.list_vulkan_devices()
        
+        # Check if single GPU mode is requested
+        single_gpu = kwargs.get('single_gpu', False)
+        tensor_split = None
+        
+        if single_gpu:
+            # Build tensor_split to force all layers onto one GPU
+            # We need to detect how many GPUs are visible to Vulkan
+            num_devices = self.count_vulkan_devices()
+            # Create tensor_split array: 1.0 for selected GPU, 0.0 for others
+            tensor_split = [0.0] * num_devices
+            if main_gpu < len(tensor_split):
+                tensor_split[main_gpu] = 1.0
+            else:
+                print(f"Warning: main_gpu={main_gpu} exceeds detected devices ({num_devices})")
+                tensor_split = None
+            
+            if tensor_split:
+                print(f"  Single GPU mode: Forcing all layers to GPU {main_gpu}")
+                print(f"  Tensor split: {tensor_split}")
+        
        try:
-            self.model = Llama(
-                model_path=model_path,
-                n_gpu_layers=n_gpu_layers,
-                n_ctx=n_ctx,
-                verbose=verbose,
-                main_gpu=main_gpu,
-            )
+            llama_kwargs = {
+                'model_path': model_path,
+                'n_gpu_layers': n_gpu_layers,
+                'n_ctx': n_ctx,
+                'verbose': verbose,
+                'main_gpu': main_gpu,
+            }
+            
+            if tensor_split:
+                llama_kwargs['tensor_split'] = tensor_split
+            
+            self.model = Llama(**llama_kwargs)
            self.model_name = model_name
            print("\nModel loaded successfully with Vulkan!")
        except Exception as e:
@@ -1301,6 +1345,11 @@ def parse_args():
        default=0,
        help="Vulkan GPU device ID to use (Vulkan backend only, default: 0). Use --vulkan-list-devices to see available devices",
    )
+    parser.add_argument(
+        "--vulkan-single-gpu",
+        action="store_true",
+        help="Force Vulkan to use only the specified GPU device (prevents layer distribution across multiple GPUs)",
+    )
    parser.add_argument(
        "--vulkan-list-devices",
        action="store_true",
@@ -1371,6 +1420,7 @@ def main():
        'n_gpu_layers': args.n_gpu_layers,
        'n_ctx': args.n_ctx,
        'main_gpu': args.vulkan_device,
+        'single_gpu': args.vulkan_single_gpu,
    }
    
    try: