- 07 Oct, 2025 40 commits
-
-
Stefy Lanza (nextime / spora ) authored
- Store cluster client info in database for persistence - Update API to read connected clients from database - Maintain compatibility with existing web interface
-
Stefy Lanza (nextime / spora ) authored
- Use _get_client_by_websocket instead of non-existent _get_client_by_socket - Fixes client connection error during process registration
-
Stefy Lanza (nextime / spora ) authored
- Client connects to wss://host:port instead of wss://host:port/cluster - Fixes connection loop issue
-
Stefy Lanza (nextime / spora ) authored
- Client now attempts to reconnect if connection is lost - Prevents processes from being restarted on reconnection - Maintains persistent cluster node operation
-
Stefy Lanza (nextime / spora ) authored
- Correct the dict comprehension for registering processes with master - Fix duplicate entries and incorrect model assignment - Apply same fix to restart workers function
-
Stefy Lanza (nextime / spora ) authored
- Workers need a local backend to connect to even in client mode - Add backend startup and readiness check for cluster clients - Ensure proper cleanup on exit
-
Stefy Lanza (nextime / spora ) authored
- Ignore cluster.crt and cluster.key generated certificates - Remove committed certificates from repository
-
Stefy Lanza (nextime / spora ) authored
- Remove 'path' parameter from _handle_client method - Compatible with websockets 12+ which removed the path argument
-
Stefy Lanza (nextime / spora ) authored
- Modify ClusterMaster to accept host parameter - Start cluster master in vidai.py when running as master - Use --cluster-host and --cluster-port for websocket server binding - Default to 0.0.0.0:5003 for cluster master
-
Stefy Lanza (nextime / spora ) authored
- Rename 'flash' variable to 'flash_enabled' to avoid shadowing the flash() function - Resolve TypeError when saving admin configuration
-
Stefy Lanza (nextime / spora ) authored
- Remove analysis_backend and training_backend fields from /admin/config - These are now configured per worker in the cluster nodes interface - Clean up unused imports and form processing
-
Stefy Lanza (nextime / spora ) authored
- Add missing set_* function imports to admin.py config route - Resolve NameError when saving admin configuration
-
Stefy Lanza (nextime / spora ) authored
- Change container max-width to 95% for better use of screen space - Maintain centered layout for the cluster nodes table
-
Stefy Lanza (nextime / spora ) authored
- Show individual select forms for every worker in the driver modal - Update API to handle per-worker driver selection for local nodes - Maintain compatibility with existing backend switching logic
-
Stefy Lanza (nextime / spora ) authored
- Add --config <file> argument to load config from custom path - Modify config loader to use custom config file if specified - Fix cluster nodes interface to only show available GPU backends for workers - Differentiate between local and remote node driver selection
-
Stefy Lanza (nextime / spora ) authored
- Removed the 'Settings' link from the admin navigation menu - Settings page route and template still exist but are no longer accessible from navbar - Admin navbar now shows: Cluster Tokens, Cluster Nodes (no Settings)
-
Stefy Lanza (nextime / spora ) authored
- Added workers array to local node API response for modal population - Fixed table colspan values to match 13 columns - Removed debug console.log statements - Modal should now open and show worker driver selection options
-
Stefy Lanza (nextime / spora ) authored
- Fixed JavaScript template literal issue preventing button clicks from working - Changed from inline onclick with template variables to data attributes + event delegation - Added event listener for .set-driver-btn class buttons - Buttons now properly read hostname and token from data attributes - Modal should now open when clicking Set Driver buttons
-
Stefy Lanza (nextime / spora ) authored
- Removed brand-specific filtering that only allowed NVIDIA GPUs - Now detects any GPU that can actually perform CUDA or ROCm operations - Functional test determines if GPU should be included, not brand - GPUs are shown with correct system indices (Device 0, 1, etc.) - AMD GPUs that support ROCm will be shown if functional - CUDA GPUs from any vendor will be shown if functional
-
Stefy Lanza (nextime / spora ) authored
- Updated GPU VRAM detection to use torch.cuda.get_device_properties(i).total_memory / 1024**3 - Same method as used in /api/stats endpoint for consistency - Still filters out non-NVIDIA and non-functional GPUs - Now shows correct VRAM amounts (e.g., 24GB for RTX 3090 instead of hardcoded 8GB) - Fixed both worker-level and node-level GPU detection
-
Stefy Lanza (nextime / spora ) authored
- Added debug output to see what CUDA device names are detected - Will help identify why AMD GPU is still being counted as CUDA device - Debug output shows device names and functional test results - User can now see what devices PyTorch is detecting
-
Stefy Lanza (nextime / spora ) authored
- Modified detect_gpu_backends() to perform functional tests on GPUs - CUDA detection now verifies devices can actually perform tensor operations - ROCm detection now tests device functionality before counting - Only NVIDIA GPUs are counted for CUDA, and only functional devices - Prevents counting of non-working GPUs like old AMD cards misreported as CUDA - Example: System with old AMD GPU (device 0) + working CUDA GPU (device 1) now correctly shows only the functional CUDA GPU - Total VRAM calculation now reflects only actually usable GPUs - Both PyTorch and nvidia-smi/rocm-smi detection paths updated
-
Stefy Lanza (nextime / spora ) authored
- Modified local node GPU memory calculation to only count GPUs that are actually available for supported backends - Previously counted all GPUs in system, now only counts CUDA GPUs if CUDA is available and ROCm GPUs if ROCm is available - Fixes issue where unsupported GPUs (like old AMD GPUs without ROCm support) were incorrectly included in VRAM totals - Example: System with old AMD GPU (8GB, no ROCm) and CUDA GPU (24GB) now correctly shows 24GB total instead of 32GB - Ensures accurate GPU resource reporting in cluster nodes interface
-
Stefy Lanza (nextime / spora ) authored
- Modified modal to show individual GPU-requiring workers on each node - Allow granular driver selection (CUDA/ROCm/CPU) for each worker subprocess - Updated database schema to store driver preferences per worker (hostname + token + worker_name) - Enhanced API to handle per-worker driver setting with form field parsing - Added restart_client_worker method to cluster master for individual worker restarts - Frontend now displays worker-specific driver selection controls in modal - Maintains node-level table view while providing worker-level configuration - Supports CPU-only nodes and mixed GPU/CPU worker configurations - Backward compatible with existing single-driver preference system
-
Stefy Lanza (nextime / spora ) authored
- Fixed undefined variable 'local_gpu_backends' in api_cluster_nodes function - Properly defined local_available_backends, local_gpu_backends, and local_cpu_backends - Updated local node detection to show nodes with any available backends (GPU or CPU) - Ensured CPU-only nodes are correctly identified and displayed - Maintained backward compatibility with existing GPU-only node detection
-
Stefy Lanza (nextime / spora ) authored
- Removed GPU-only requirement for cluster client connections - CPU-only clients can now join cluster and run CPU-based workers - Master accepts all clients regardless of GPU availability - Nodes are properly marked as CPU-only when no GPUs detected - Driver selection modal supports CUDA, ROCm, and CPU backends - Local and remote workers can use any available backend (GPU or CPU) - Enhanced cluster flexibility for mixed hardware environments - CPU nodes contribute to cluster for CPU-only processing tasks - Maintains backward compatibility with existing GPU-only workflows - Clear node type identification in cluster management interface
-
Stefy Lanza (nextime / spora ) authored
- Cluster clients now refuse to connect without GPU capabilities (CUDA/ROCm) - Cluster master rejects authentication from clients without GPU backends - Local master node only appears in cluster nodes list if GPU backends are available - Master already prevented launching local worker processes without GPUs - Systems without GPUs cannot participate in distributed processing - Clear error messages when GPU requirements are not met - Maintains cluster integrity by ensuring all nodes contribute computational power
-
Stefy Lanza (nextime / spora ) authored
- Removed CPU option from driver selection (only CUDA/ROCm GPU drivers) - Set CUDA as default driver selection when available - Added available_gpu_backends field to node API responses - Frontend dynamically populates driver options based on node's available GPUs - API validation rejects non-GPU driver requests - Cluster clients only accept CUDA/ROCm backend restart commands - Improved user experience by showing only relevant driver options per node
-
Stefy Lanza (nextime / spora ) authored
- Added restart_workers command from master to clients for backend switching - Cluster clients can now restart their workers with different backends (CUDA/ROCm/CPU) - Added mixed GPU detection - nodes with both CUDA and ROCm show 'Mixed GPU Available' indicator - Clients with mixed GPUs can switch between CUDA and ROCm backends dynamically - Updated API endpoint to send restart commands to connected clients - Clients save driver preferences and restart workers immediately when changed - Graceful fallback to available backends if requested backend not available - Visual indicator for nodes capable of backend switching
-
Stefy Lanza (nextime / spora ) authored
- Display actual cluster master weight instead of 'N/A' for local node - Implement driver switching for local workers via modal popup - Add switch_local_worker_backends() function to restart workers with new backends - Update API endpoint to handle local worker driver changes - Add CPU option to driver selection modal - Local workers can now switch between CUDA, ROCm, and CPU backends dynamically - Workers are terminated and restarted with new backend configuration
-
Stefy Lanza (nextime / spora ) authored
- Added cluster_master_weight config option (default: 'auto') - Implemented weight precedence: command line > config file > default 'auto' - 'auto' mode enables automatic weight adjustment (100->0 on first client, 0->100 when all disconnect) - Explicit numeric weights disable automatic adjustment - Updated sample config file with cluster_master_weight setting - Enhanced command line parsing to accept 'auto' or numeric values - Improved startup messages to indicate weight source and behavior
-
Stefy Lanza (nextime / spora ) authored
- Added weight_explicit flag to track if --weight was specified on command line - Automatic weight changes (100->0 on first client, 0->100 on last disconnect) only apply when weight is not explicitly set - When --weight is specified, master maintains the explicit weight regardless of client connections - Updated command line help and startup messages to clarify the behavior - This allows administrators to override automatic weight management when needed
-
Stefy Lanza (nextime / spora ) authored
- Modified API to aggregate workers per node instead of showing each worker separately - Each cluster node now appears as a single row with summarized worker information - Workers column shows count and types: '2 workers - Analysis (CUDA), Training (ROCm)' - Local workers are grouped into a single 'Local Master Node' entry - Updated frontend to display worker summaries with detailed breakdown - Updated API documentation to reflect new response format with workers_summary field
-
Stefy Lanza (nextime / spora ) authored
- Detect running local worker processes on cluster master using psutil - Include local workers in cluster nodes API response with distinct styling - Show local workers with blue background and 'Local' status indicator - Display backend information (CUDA/ROCm) in worker names - Indicate that local workers require manual restart for driver changes - Update API documentation with local worker response format - Local workers show N/A for weight since they don't participate in cluster load balancing
-
Stefy Lanza (nextime / spora ) authored
- Add weight column to cluster nodes table showing load balancing weight - Set default weights: master=0, clients=100 - Update API response to include client weight - Update frontend to display weight information - Update API documentation with weight field
-
Stefy Lanza (nextime / spora ) authored
- Add --shared-dir argument to cluster_master.py and cluster_client.py - Implement shared directory file transfer for model files - Falls back to websocket transfer if shared directory unavailable - Update cluster client to handle model_shared_file messages - Add documentation for shared directory feature in architecture.md - Maintain backward compatibility with existing websocket transfers
-
Stefy Lanza (nextime / spora ) authored
- Add uptime calculation for cluster nodes and master - Include active/completed job counts per node and totals for master - Display cluster master statistics before the nodes list - Update API response format with master_stats and node-level metrics - Add uptime formatting and job statistics to frontend - Update API documentation with new response structure
-
Stefy Lanza (nextime / spora ) authored
- Add hostname passing from cluster client to master - Create client_driver_preferences database table for storing driver preferences - Add /admin/cluster_nodes page with auto-updating node list - Add API endpoints for fetching nodes and setting driver preferences - Update admin navbar and API documentation - Apply database migrations
-
Stefy Lanza (nextime / spora ) authored
Implement secure websockets for cluster master and client with auto-generated self-signed certificates
-
Stefy Lanza (nextime / spora ) authored
Show all defaults in /admin/config if not set in database, hide configs set by config file/CLI/env, add redis config
-