Commits · 1da6025d49146e3e92b028143d7df704a5046afb · SexHackMe / vidai

07 Oct, 2025 40 commits

Fix GPU VRAM detection for cluster clients · 1da6025d

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Update detect_gpu_backends to collect actual VRAM for each GPU device
- Store device info including VRAM in gpu_info sent to master
- Use real VRAM data in cluster nodes API instead of hardcoded values
- Ensure consistent VRAM reporting between master and clients

1da6025d

Remove Status column from cluster nodes table · 97a13987

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Use green background color to indicate connected nodes
- Remove status text column for cleaner interface
- Update colspan values for table messages

97a13987

Update cluster nodes API to read from database · eb1870d9

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Store cluster client info in database for persistence
- Update API to read connected clients from database
- Maintain compatibility with existing web interface

eb1870d9

Fix method call in cluster master register processes · 4b98cda4

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Use _get_client_by_websocket instead of non-existent _get_client_by_socket
- Fixes client connection error during process registration

4b98cda4

Remove /cluster path from websocket URI · 0fc8c705
Stefy Lanza (nextime / spora ) authored Oct 07, 2025
```
- Client connects to wss://host:port instead of wss://host:port/cluster
- Fixes connection loop issue
```
0fc8c705

Add reconnection logic to cluster client · d87215b6

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Client now attempts to reconnect if connection is lost
- Prevents processes from being restarted on reconnection
- Maintains persistent cluster node operation

d87215b6

Fix cluster client process registration dict comprehension · b0c7da40

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Correct the dict comprehension for registering processes with master
- Fix duplicate entries and incorrect model assignment
- Apply same fix to restart workers function

b0c7da40

Start local backend in cluster client mode · d3ac0046

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Workers need a local backend to connect to even in client mode
- Add backend startup and readiness check for cluster clients
- Ensure proper cleanup on exit

d3ac0046

Add cluster SSL certificates to .gitignore · 0bb73422
Stefy Lanza (nextime / spora ) authored Oct 07, 2025
```
- Ignore cluster.crt and cluster.key generated certificates
- Remove committed certificates from repository
```
0bb73422
Fix websockets handler signature for newer websockets version · 8d471b19
Stefy Lanza (nextime / spora ) authored Oct 07, 2025
```
- Remove 'path' parameter from _handle_client method
- Compatible with websockets 12+ which removed the path argument
```
8d471b19

Integrate secure websocket cluster master into main vidai.py · 81b440d2

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Modify ClusterMaster to accept host parameter
- Start cluster master in vidai.py when running as master
- Use --cluster-host and --cluster-port for websocket server binding
- Default to 0.0.0.0:5003 for cluster master

81b440d2

Fix variable name conflict in admin config · 772f6213

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Rename 'flash' variable to 'flash_enabled' to avoid shadowing the flash() function
- Resolve TypeError when saving admin configuration

772f6213

Remove backend selection from admin config AI Settings · 2c7ae960

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Remove analysis_backend and training_backend fields from /admin/config
- These are now configured per worker in the cluster nodes interface
- Clean up unused imports and form processing

2c7ae960

Fix missing imports in admin config page · 02b5e4e9

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add missing set_* function imports to admin.py config route
- Resolve NameError when saving admin configuration

02b5e4e9

Enlarge cluster nodes page container to 95% width · a55af666

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Change container max-width to 95% for better use of screen space
- Maintain centered layout for the cluster nodes table

a55af666

Fix cluster nodes modal to show per-worker driver selects · 044c2cf2

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Show individual select forms for every worker in the driver modal
- Update API to handle per-worker driver selection for local nodes
- Maintain compatibility with existing backend switching logic

044c2cf2

Add --config CLI argument and fix cluster nodes driver selection · a7d2d90e

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add --config <file> argument to load config from custom path
- Modify config loader to use custom config file if specified
- Fix cluster nodes interface to only show available GPU backends for workers
- Differentiate between local and remote node driver selection

a7d2d90e

Remove Settings page link from admin navbar · 63965769

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Removed the 'Settings' link from the admin navigation menu
- Settings page route and template still exist but are no longer accessible from navbar
- Admin navbar now shows: Cluster Tokens, Cluster Nodes (no Settings)

63965769

Fix Set Driver modal functionality · 1e53bfb9

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Added workers array to local node API response for modal population
- Fixed table colspan values to match 13 columns
- Removed debug console.log statements
- Modal should now open and show worker driver selection options

1e53bfb9

Fix Set Driver button click handler · 314a1125

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Fixed JavaScript template literal issue preventing button clicks from working
- Changed from inline onclick with template variables to data attributes + event delegation
- Added event listener for .set-driver-btn class buttons
- Buttons now properly read hostname and token from data attributes
- Modal should now open when clicking Set Driver buttons

314a1125

Remove NVIDIA-only GPU filtering, detect all working CUDA/ROCm GPUs · 13ffc88e

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Removed brand-specific filtering that only allowed NVIDIA GPUs
- Now detects any GPU that can actually perform CUDA or ROCm operations
- Functional test determines if GPU should be included, not brand
- GPUs are shown with correct system indices (Device 0, 1, etc.)
- AMD GPUs that support ROCm will be shown if functional
- CUDA GPUs from any vendor will be shown if functional

13ffc88e

Fix GPU VRAM detection to use correct method from /api/stats · efbb77ce

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Updated GPU VRAM detection to use torch.cuda.get_device_properties(i).total_memory / 1024**3
- Same method as used in /api/stats endpoint for consistency
- Still filters out non-NVIDIA and non-functional GPUs
- Now shows correct VRAM amounts (e.g., 24GB for RTX 3090 instead of hardcoded 8GB)
- Fixed both worker-level and node-level GPU detection

efbb77ce

Add debug logging to GPU detection · f91fafcf

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Added debug output to see what CUDA device names are detected
- Will help identify why AMD GPU is still being counted as CUDA device
- Debug output shows device names and functional test results
- User can now see what devices PyTorch is detecting

f91fafcf

Fix GPU detection to only count working, functional GPUs · 056cbbf3

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Modified detect_gpu_backends() to perform functional tests on GPUs
- CUDA detection now verifies devices can actually perform tensor operations
- ROCm detection now tests device functionality before counting
- Only NVIDIA GPUs are counted for CUDA, and only functional devices
- Prevents counting of non-working GPUs like old AMD cards misreported as CUDA
- Example: System with old AMD GPU (device 0) + working CUDA GPU (device 1) now correctly shows only the functional CUDA GPU
- Total VRAM calculation now reflects only actually usable GPUs
- Both PyTorch and nvidia-smi/rocm-smi detection paths updated

056cbbf3

Fix GPU VRAM detection to count only available GPUs · ffe34516

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Modified local node GPU memory calculation to only count GPUs that are actually available for supported backends
- Previously counted all GPUs in system, now only counts CUDA GPUs if CUDA is available and ROCm GPUs if ROCm is available
- Fixes issue where unsupported GPUs (like old AMD GPUs without ROCm support) were incorrectly included in VRAM totals
- Example: System with old AMD GPU (8GB, no ROCm) and CUDA GPU (24GB) now correctly shows 24GB total instead of 32GB
- Ensures accurate GPU resource reporting in cluster nodes interface

ffe34516

Implement per-worker driver selection modal · 4ca34e75

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Modified modal to show individual GPU-requiring workers on each node
- Allow granular driver selection (CUDA/ROCm/CPU) for each worker subprocess
- Updated database schema to store driver preferences per worker (hostname + token + worker_name)
- Enhanced API to handle per-worker driver setting with form field parsing
- Added restart_client_worker method to cluster master for individual worker restarts
- Frontend now displays worker-specific driver selection controls in modal
- Maintains node-level table view while providing worker-level configuration
- Supports CPU-only nodes and mixed GPU/CPU worker configurations
- Backward compatible with existing single-driver preference system

4ca34e75

Fix NameError in cluster nodes API · 5cbdab26

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Fixed undefined variable 'local_gpu_backends' in api_cluster_nodes function
- Properly defined local_available_backends, local_gpu_backends, and local_cpu_backends
- Updated local node detection to show nodes with any available backends (GPU or CPU)
- Ensured CPU-only nodes are correctly identified and displayed
- Maintained backward compatibility with existing GPU-only node detection

5cbdab26

Allow CPU-only cluster clients and flexible backend support · bd087af5

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Removed GPU-only requirement for cluster client connections
- CPU-only clients can now join cluster and run CPU-based workers
- Master accepts all clients regardless of GPU availability
- Nodes are properly marked as CPU-only when no GPUs detected
- Driver selection modal supports CUDA, ROCm, and CPU backends
- Local and remote workers can use any available backend (GPU or CPU)
- Enhanced cluster flexibility for mixed hardware environments
- CPU nodes contribute to cluster for CPU-only processing tasks
- Maintains backward compatibility with existing GPU-only workflows
- Clear node type identification in cluster management interface

bd087af5

Enforce GPU-only cluster participation · f57a1468

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Cluster clients now refuse to connect without GPU capabilities (CUDA/ROCm)
- Cluster master rejects authentication from clients without GPU backends
- Local master node only appears in cluster nodes list if GPU backends are available
- Master already prevented launching local worker processes without GPUs
- Systems without GPUs cannot participate in distributed processing
- Clear error messages when GPU requirements are not met
- Maintains cluster integrity by ensuring all nodes contribute computational power

f57a1468

Restrict driver selection to available GPU backends only · abec9e31

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Removed CPU option from driver selection (only CUDA/ROCm GPU drivers)
- Set CUDA as default driver selection when available
- Added available_gpu_backends field to node API responses
- Frontend dynamically populates driver options based on node's available GPUs
- API validation rejects non-GPU driver requests
- Cluster clients only accept CUDA/ROCm backend restart commands
- Improved user experience by showing only relevant driver options per node

abec9e31

Enable dynamic backend switching for cluster clients with mixed GPU support · bedc1de9

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Added restart_workers command from master to clients for backend switching
- Cluster clients can now restart their workers with different backends (CUDA/ROCm/CPU)
- Added mixed GPU detection - nodes with both CUDA and ROCm show 'Mixed GPU Available' indicator
- Clients with mixed GPUs can switch between CUDA and ROCm backends dynamically
- Updated API endpoint to send restart commands to connected clients
- Clients save driver preferences and restart workers immediately when changed
- Graceful fallback to available backends if requested backend not available
- Visual indicator for nodes capable of backend switching

bedc1de9

Enable driver switching for local workers and show master weight · 6b838e4a

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Display actual cluster master weight instead of 'N/A' for local node
- Implement driver switching for local workers via modal popup
- Add switch_local_worker_backends() function to restart workers with new backends
- Update API endpoint to handle local worker driver changes
- Add CPU option to driver selection modal
- Local workers can now switch between CUDA, ROCm, and CPU backends dynamically
- Workers are terminated and restarted with new backend configuration

6b838e4a

Add config file support for cluster master weight with 'auto' mode · fb7ad973

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Added cluster_master_weight config option (default: 'auto')
- Implemented weight precedence: command line > config file > default 'auto'
- 'auto' mode enables automatic weight adjustment (100->0 on first client, 0->100 when all disconnect)
- Explicit numeric weights disable automatic adjustment
- Updated sample config file with cluster_master_weight setting
- Enhanced command line parsing to accept 'auto' or numeric values
- Improved startup messages to indicate weight source and behavior

fb7ad973

Make cluster master weight auto-adjustment conditional on explicit setting · 8fedb8dc

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Added weight_explicit flag to track if --weight was specified on command line
- Automatic weight changes (100->0 on first client, 0->100 on last disconnect) only apply when weight is not explicitly set
- When --weight is specified, master maintains the explicit weight regardless of client connections
- Updated command line help and startup messages to clarify the behavior
- This allows administrators to override automatic weight management when needed

8fedb8dc

Refactor cluster nodes display to show nodes instead of individual workers · 711719c4

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Modified API to aggregate workers per node instead of showing each worker separately
- Each cluster node now appears as a single row with summarized worker information
- Workers column shows count and types: '2 workers - Analysis (CUDA), Training (ROCm)'
- Local workers are grouped into a single 'Local Master Node' entry
- Updated frontend to display worker summaries with detailed breakdown
- Updated API documentation to reflect new response format with workers_summary field

711719c4

Add local worker processes to cluster nodes display · 27e73381

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Detect running local worker processes on cluster master using psutil
- Include local workers in cluster nodes API response with distinct styling
- Show local workers with blue background and 'Local' status indicator
- Display backend information (CUDA/ROCm) in worker names
- Indicate that local workers require manual restart for driver changes
- Update API documentation with local worker response format
- Local workers show N/A for weight since they don't participate in cluster load balancing

27e73381

Add client weight display to cluster nodes page · 1c9ae89a

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add weight column to cluster nodes table showing load balancing weight
- Set default weights: master=0, clients=100
- Update API response to include client weight
- Update frontend to display weight information
- Update API documentation with weight field

1c9ae89a

Add --cluster-shared-dir option for optimized file transfers · b48679df

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add --shared-dir argument to cluster_master.py and cluster_client.py
- Implement shared directory file transfer for model files
- Falls back to websocket transfer if shared directory unavailable
- Update cluster client to handle model_shared_file messages
- Add documentation for shared directory feature in architecture.md
- Maintain backward compatibility with existing websocket transfers

b48679df

Enhance cluster nodes page with uptime, job stats, and master statistics · 3c309139

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add uptime calculation for cluster nodes and master
- Include active/completed job counts per node and totals for master
- Display cluster master statistics before the nodes list
- Update API response format with master_stats and node-level metrics
- Add uptime formatting and job statistics to frontend
- Update API documentation with new response structure

3c309139

Add admin cluster nodes page with real-time monitoring and driver preferences · 3f496bf6

Stefy Lanza (nextime / spora ) authored Oct 07, 2025

- Add hostname passing from cluster client to master
- Create client_driver_preferences database table for storing driver preferences
- Add /admin/cluster_nodes page with auto-updating node list
- Add API endpoints for fetching nodes and setting driver preferences
- Update admin navbar and API documentation
- Apply database migrations

3f496bf6