1. 07 Oct, 2025 40 commits
    • Stefy Lanza (nextime / spora )'s avatar
      Fix GPU VRAM detection for cluster clients · 1da6025d
      Stefy Lanza (nextime / spora ) authored
      - Update detect_gpu_backends to collect actual VRAM for each GPU device
      - Store device info including VRAM in gpu_info sent to master
      - Use real VRAM data in cluster nodes API instead of hardcoded values
      - Ensure consistent VRAM reporting between master and clients
      1da6025d
    • Stefy Lanza (nextime / spora )'s avatar
      Remove Status column from cluster nodes table · 97a13987
      Stefy Lanza (nextime / spora ) authored
      - Use green background color to indicate connected nodes
      - Remove status text column for cleaner interface
      - Update colspan values for table messages
      97a13987
    • Stefy Lanza (nextime / spora )'s avatar
      Update cluster nodes API to read from database · eb1870d9
      Stefy Lanza (nextime / spora ) authored
      - Store cluster client info in database for persistence
      - Update API to read connected clients from database
      - Maintain compatibility with existing web interface
      eb1870d9
    • Stefy Lanza (nextime / spora )'s avatar
      Fix method call in cluster master register processes · 4b98cda4
      Stefy Lanza (nextime / spora ) authored
      - Use _get_client_by_websocket instead of non-existent _get_client_by_socket
      - Fixes client connection error during process registration
      4b98cda4
    • Stefy Lanza (nextime / spora )'s avatar
      Remove /cluster path from websocket URI · 0fc8c705
      Stefy Lanza (nextime / spora ) authored
      - Client connects to wss://host:port instead of wss://host:port/cluster
      - Fixes connection loop issue
      0fc8c705
    • Stefy Lanza (nextime / spora )'s avatar
      Add reconnection logic to cluster client · d87215b6
      Stefy Lanza (nextime / spora ) authored
      - Client now attempts to reconnect if connection is lost
      - Prevents processes from being restarted on reconnection
      - Maintains persistent cluster node operation
      d87215b6
    • Stefy Lanza (nextime / spora )'s avatar
      Fix cluster client process registration dict comprehension · b0c7da40
      Stefy Lanza (nextime / spora ) authored
      - Correct the dict comprehension for registering processes with master
      - Fix duplicate entries and incorrect model assignment
      - Apply same fix to restart workers function
      b0c7da40
    • Stefy Lanza (nextime / spora )'s avatar
      Start local backend in cluster client mode · d3ac0046
      Stefy Lanza (nextime / spora ) authored
      - Workers need a local backend to connect to even in client mode
      - Add backend startup and readiness check for cluster clients
      - Ensure proper cleanup on exit
      d3ac0046
    • Stefy Lanza (nextime / spora )'s avatar
      Add cluster SSL certificates to .gitignore · 0bb73422
      Stefy Lanza (nextime / spora ) authored
      - Ignore cluster.crt and cluster.key generated certificates
      - Remove committed certificates from repository
      0bb73422
    • Stefy Lanza (nextime / spora )'s avatar
      Fix websockets handler signature for newer websockets version · 8d471b19
      Stefy Lanza (nextime / spora ) authored
      - Remove 'path' parameter from _handle_client method
      - Compatible with websockets 12+ which removed the path argument
      8d471b19
    • Stefy Lanza (nextime / spora )'s avatar
      Integrate secure websocket cluster master into main vidai.py · 81b440d2
      Stefy Lanza (nextime / spora ) authored
      - Modify ClusterMaster to accept host parameter
      - Start cluster master in vidai.py when running as master
      - Use --cluster-host and --cluster-port for websocket server binding
      - Default to 0.0.0.0:5003 for cluster master
      81b440d2
    • Stefy Lanza (nextime / spora )'s avatar
      Fix variable name conflict in admin config · 772f6213
      Stefy Lanza (nextime / spora ) authored
      - Rename 'flash' variable to 'flash_enabled' to avoid shadowing the flash() function
      - Resolve TypeError when saving admin configuration
      772f6213
    • Stefy Lanza (nextime / spora )'s avatar
      Remove backend selection from admin config AI Settings · 2c7ae960
      Stefy Lanza (nextime / spora ) authored
      - Remove analysis_backend and training_backend fields from /admin/config
      - These are now configured per worker in the cluster nodes interface
      - Clean up unused imports and form processing
      2c7ae960
    • Stefy Lanza (nextime / spora )'s avatar
      Fix missing imports in admin config page · 02b5e4e9
      Stefy Lanza (nextime / spora ) authored
      - Add missing set_* function imports to admin.py config route
      - Resolve NameError when saving admin configuration
      02b5e4e9
    • Stefy Lanza (nextime / spora )'s avatar
      Enlarge cluster nodes page container to 95% width · a55af666
      Stefy Lanza (nextime / spora ) authored
      - Change container max-width to 95% for better use of screen space
      - Maintain centered layout for the cluster nodes table
      a55af666
    • Stefy Lanza (nextime / spora )'s avatar
      Fix cluster nodes modal to show per-worker driver selects · 044c2cf2
      Stefy Lanza (nextime / spora ) authored
      - Show individual select forms for every worker in the driver modal
      - Update API to handle per-worker driver selection for local nodes
      - Maintain compatibility with existing backend switching logic
      044c2cf2
    • Stefy Lanza (nextime / spora )'s avatar
      Add --config CLI argument and fix cluster nodes driver selection · a7d2d90e
      Stefy Lanza (nextime / spora ) authored
      - Add --config <file> argument to load config from custom path
      - Modify config loader to use custom config file if specified
      - Fix cluster nodes interface to only show available GPU backends for workers
      - Differentiate between local and remote node driver selection
      a7d2d90e
    • Stefy Lanza (nextime / spora )'s avatar
      Remove Settings page link from admin navbar · 63965769
      Stefy Lanza (nextime / spora ) authored
      - Removed the 'Settings' link from the admin navigation menu
      - Settings page route and template still exist but are no longer accessible from navbar
      - Admin navbar now shows: Cluster Tokens, Cluster Nodes (no Settings)
      63965769
    • Stefy Lanza (nextime / spora )'s avatar
      Fix Set Driver modal functionality · 1e53bfb9
      Stefy Lanza (nextime / spora ) authored
      - Added workers array to local node API response for modal population
      - Fixed table colspan values to match 13 columns
      - Removed debug console.log statements
      - Modal should now open and show worker driver selection options
      1e53bfb9
    • Stefy Lanza (nextime / spora )'s avatar
      Fix Set Driver button click handler · 314a1125
      Stefy Lanza (nextime / spora ) authored
      - Fixed JavaScript template literal issue preventing button clicks from working
      - Changed from inline onclick with template variables to data attributes + event delegation
      - Added event listener for .set-driver-btn class buttons
      - Buttons now properly read hostname and token from data attributes
      - Modal should now open when clicking Set Driver buttons
      314a1125
    • Stefy Lanza (nextime / spora )'s avatar
      Remove NVIDIA-only GPU filtering, detect all working CUDA/ROCm GPUs · 13ffc88e
      Stefy Lanza (nextime / spora ) authored
      - Removed brand-specific filtering that only allowed NVIDIA GPUs
      - Now detects any GPU that can actually perform CUDA or ROCm operations
      - Functional test determines if GPU should be included, not brand
      - GPUs are shown with correct system indices (Device 0, 1, etc.)
      - AMD GPUs that support ROCm will be shown if functional
      - CUDA GPUs from any vendor will be shown if functional
      13ffc88e
    • Stefy Lanza (nextime / spora )'s avatar
      Fix GPU VRAM detection to use correct method from /api/stats · efbb77ce
      Stefy Lanza (nextime / spora ) authored
      - Updated GPU VRAM detection to use torch.cuda.get_device_properties(i).total_memory / 1024**3
      - Same method as used in /api/stats endpoint for consistency
      - Still filters out non-NVIDIA and non-functional GPUs
      - Now shows correct VRAM amounts (e.g., 24GB for RTX 3090 instead of hardcoded 8GB)
      - Fixed both worker-level and node-level GPU detection
      efbb77ce
    • Stefy Lanza (nextime / spora )'s avatar
      Add debug logging to GPU detection · f91fafcf
      Stefy Lanza (nextime / spora ) authored
      - Added debug output to see what CUDA device names are detected
      - Will help identify why AMD GPU is still being counted as CUDA device
      - Debug output shows device names and functional test results
      - User can now see what devices PyTorch is detecting
      f91fafcf
    • Stefy Lanza (nextime / spora )'s avatar
      Fix GPU detection to only count working, functional GPUs · 056cbbf3
      Stefy Lanza (nextime / spora ) authored
      - Modified detect_gpu_backends() to perform functional tests on GPUs
      - CUDA detection now verifies devices can actually perform tensor operations
      - ROCm detection now tests device functionality before counting
      - Only NVIDIA GPUs are counted for CUDA, and only functional devices
      - Prevents counting of non-working GPUs like old AMD cards misreported as CUDA
      - Example: System with old AMD GPU (device 0) + working CUDA GPU (device 1) now correctly shows only the functional CUDA GPU
      - Total VRAM calculation now reflects only actually usable GPUs
      - Both PyTorch and nvidia-smi/rocm-smi detection paths updated
      056cbbf3
    • Stefy Lanza (nextime / spora )'s avatar
      Fix GPU VRAM detection to count only available GPUs · ffe34516
      Stefy Lanza (nextime / spora ) authored
      - Modified local node GPU memory calculation to only count GPUs that are actually available for supported backends
      - Previously counted all GPUs in system, now only counts CUDA GPUs if CUDA is available and ROCm GPUs if ROCm is available
      - Fixes issue where unsupported GPUs (like old AMD GPUs without ROCm support) were incorrectly included in VRAM totals
      - Example: System with old AMD GPU (8GB, no ROCm) and CUDA GPU (24GB) now correctly shows 24GB total instead of 32GB
      - Ensures accurate GPU resource reporting in cluster nodes interface
      ffe34516
    • Stefy Lanza (nextime / spora )'s avatar
      Implement per-worker driver selection modal · 4ca34e75
      Stefy Lanza (nextime / spora ) authored
      - Modified modal to show individual GPU-requiring workers on each node
      - Allow granular driver selection (CUDA/ROCm/CPU) for each worker subprocess
      - Updated database schema to store driver preferences per worker (hostname + token + worker_name)
      - Enhanced API to handle per-worker driver setting with form field parsing
      - Added restart_client_worker method to cluster master for individual worker restarts
      - Frontend now displays worker-specific driver selection controls in modal
      - Maintains node-level table view while providing worker-level configuration
      - Supports CPU-only nodes and mixed GPU/CPU worker configurations
      - Backward compatible with existing single-driver preference system
      4ca34e75
    • Stefy Lanza (nextime / spora )'s avatar
      Fix NameError in cluster nodes API · 5cbdab26
      Stefy Lanza (nextime / spora ) authored
      - Fixed undefined variable 'local_gpu_backends' in api_cluster_nodes function
      - Properly defined local_available_backends, local_gpu_backends, and local_cpu_backends
      - Updated local node detection to show nodes with any available backends (GPU or CPU)
      - Ensured CPU-only nodes are correctly identified and displayed
      - Maintained backward compatibility with existing GPU-only node detection
      5cbdab26
    • Stefy Lanza (nextime / spora )'s avatar
      Allow CPU-only cluster clients and flexible backend support · bd087af5
      Stefy Lanza (nextime / spora ) authored
      - Removed GPU-only requirement for cluster client connections
      - CPU-only clients can now join cluster and run CPU-based workers
      - Master accepts all clients regardless of GPU availability
      - Nodes are properly marked as CPU-only when no GPUs detected
      - Driver selection modal supports CUDA, ROCm, and CPU backends
      - Local and remote workers can use any available backend (GPU or CPU)
      - Enhanced cluster flexibility for mixed hardware environments
      - CPU nodes contribute to cluster for CPU-only processing tasks
      - Maintains backward compatibility with existing GPU-only workflows
      - Clear node type identification in cluster management interface
      bd087af5
    • Stefy Lanza (nextime / spora )'s avatar
      Enforce GPU-only cluster participation · f57a1468
      Stefy Lanza (nextime / spora ) authored
      - Cluster clients now refuse to connect without GPU capabilities (CUDA/ROCm)
      - Cluster master rejects authentication from clients without GPU backends
      - Local master node only appears in cluster nodes list if GPU backends are available
      - Master already prevented launching local worker processes without GPUs
      - Systems without GPUs cannot participate in distributed processing
      - Clear error messages when GPU requirements are not met
      - Maintains cluster integrity by ensuring all nodes contribute computational power
      f57a1468
    • Stefy Lanza (nextime / spora )'s avatar
      Restrict driver selection to available GPU backends only · abec9e31
      Stefy Lanza (nextime / spora ) authored
      - Removed CPU option from driver selection (only CUDA/ROCm GPU drivers)
      - Set CUDA as default driver selection when available
      - Added available_gpu_backends field to node API responses
      - Frontend dynamically populates driver options based on node's available GPUs
      - API validation rejects non-GPU driver requests
      - Cluster clients only accept CUDA/ROCm backend restart commands
      - Improved user experience by showing only relevant driver options per node
      abec9e31
    • Stefy Lanza (nextime / spora )'s avatar
      Enable dynamic backend switching for cluster clients with mixed GPU support · bedc1de9
      Stefy Lanza (nextime / spora ) authored
      - Added restart_workers command from master to clients for backend switching
      - Cluster clients can now restart their workers with different backends (CUDA/ROCm/CPU)
      - Added mixed GPU detection - nodes with both CUDA and ROCm show 'Mixed GPU Available' indicator
      - Clients with mixed GPUs can switch between CUDA and ROCm backends dynamically
      - Updated API endpoint to send restart commands to connected clients
      - Clients save driver preferences and restart workers immediately when changed
      - Graceful fallback to available backends if requested backend not available
      - Visual indicator for nodes capable of backend switching
      bedc1de9
    • Stefy Lanza (nextime / spora )'s avatar
      Enable driver switching for local workers and show master weight · 6b838e4a
      Stefy Lanza (nextime / spora ) authored
      - Display actual cluster master weight instead of 'N/A' for local node
      - Implement driver switching for local workers via modal popup
      - Add switch_local_worker_backends() function to restart workers with new backends
      - Update API endpoint to handle local worker driver changes
      - Add CPU option to driver selection modal
      - Local workers can now switch between CUDA, ROCm, and CPU backends dynamically
      - Workers are terminated and restarted with new backend configuration
      6b838e4a
    • Stefy Lanza (nextime / spora )'s avatar
      Add config file support for cluster master weight with 'auto' mode · fb7ad973
      Stefy Lanza (nextime / spora ) authored
      - Added cluster_master_weight config option (default: 'auto')
      - Implemented weight precedence: command line > config file > default 'auto'
      - 'auto' mode enables automatic weight adjustment (100->0 on first client, 0->100 when all disconnect)
      - Explicit numeric weights disable automatic adjustment
      - Updated sample config file with cluster_master_weight setting
      - Enhanced command line parsing to accept 'auto' or numeric values
      - Improved startup messages to indicate weight source and behavior
      fb7ad973
    • Stefy Lanza (nextime / spora )'s avatar
      Make cluster master weight auto-adjustment conditional on explicit setting · 8fedb8dc
      Stefy Lanza (nextime / spora ) authored
      - Added weight_explicit flag to track if --weight was specified on command line
      - Automatic weight changes (100->0 on first client, 0->100 on last disconnect) only apply when weight is not explicitly set
      - When --weight is specified, master maintains the explicit weight regardless of client connections
      - Updated command line help and startup messages to clarify the behavior
      - This allows administrators to override automatic weight management when needed
      8fedb8dc
    • Stefy Lanza (nextime / spora )'s avatar
      Refactor cluster nodes display to show nodes instead of individual workers · 711719c4
      Stefy Lanza (nextime / spora ) authored
      - Modified API to aggregate workers per node instead of showing each worker separately
      - Each cluster node now appears as a single row with summarized worker information
      - Workers column shows count and types: '2 workers - Analysis (CUDA), Training (ROCm)'
      - Local workers are grouped into a single 'Local Master Node' entry
      - Updated frontend to display worker summaries with detailed breakdown
      - Updated API documentation to reflect new response format with workers_summary field
      711719c4
    • Stefy Lanza (nextime / spora )'s avatar
      Add local worker processes to cluster nodes display · 27e73381
      Stefy Lanza (nextime / spora ) authored
      - Detect running local worker processes on cluster master using psutil
      - Include local workers in cluster nodes API response with distinct styling
      - Show local workers with blue background and 'Local' status indicator
      - Display backend information (CUDA/ROCm) in worker names
      - Indicate that local workers require manual restart for driver changes
      - Update API documentation with local worker response format
      - Local workers show N/A for weight since they don't participate in cluster load balancing
      27e73381
    • Stefy Lanza (nextime / spora )'s avatar
      Add client weight display to cluster nodes page · 1c9ae89a
      Stefy Lanza (nextime / spora ) authored
      - Add weight column to cluster nodes table showing load balancing weight
      - Set default weights: master=0, clients=100
      - Update API response to include client weight
      - Update frontend to display weight information
      - Update API documentation with weight field
      1c9ae89a
    • Stefy Lanza (nextime / spora )'s avatar
      Add --cluster-shared-dir option for optimized file transfers · b48679df
      Stefy Lanza (nextime / spora ) authored
      - Add --shared-dir argument to cluster_master.py and cluster_client.py
      - Implement shared directory file transfer for model files
      - Falls back to websocket transfer if shared directory unavailable
      - Update cluster client to handle model_shared_file messages
      - Add documentation for shared directory feature in architecture.md
      - Maintain backward compatibility with existing websocket transfers
      b48679df
    • Stefy Lanza (nextime / spora )'s avatar
      Enhance cluster nodes page with uptime, job stats, and master statistics · 3c309139
      Stefy Lanza (nextime / spora ) authored
      - Add uptime calculation for cluster nodes and master
      - Include active/completed job counts per node and totals for master
      - Display cluster master statistics before the nodes list
      - Update API response format with master_stats and node-level metrics
      - Add uptime formatting and job statistics to frontend
      - Update API documentation with new response structure
      3c309139
    • Stefy Lanza (nextime / spora )'s avatar
      Add admin cluster nodes page with real-time monitoring and driver preferences · 3f496bf6
      Stefy Lanza (nextime / spora ) authored
      - Add hostname passing from cluster client to master
      - Create client_driver_preferences database table for storing driver preferences
      - Add /admin/cluster_nodes page with auto-updating node list
      - Add API endpoints for fetching nodes and setting driver preferences
      - Update admin navbar and API documentation
      - Apply database migrations
      3f496bf6