- 08 Oct, 2025 40 commits
-
-
Stefy Lanza (nextime / spora ) authored
- Use get_queue_by_job_id to check job status - More reliable than TCP polling for local jobs
-
Stefy Lanza (nextime / spora ) authored
- Initialize self.pending_jobs dict for job monitoring tasks - Fixes AttributeError when assigning local jobs
-
Stefy Lanza (nextime / spora ) authored
- Use 'analyze_request' instead of 'analysis_request' - Match the expected message type in worker processes
-
Stefy Lanza (nextime / spora ) authored
- Change response.get('msg_type') to response.msg_type - Change response.get('data') to response.data - Message objects don't have get method, use attributes instead
-
Stefy Lanza (nextime / spora ) authored
- If --weight is not specified, master weight changes to 0 when clients connect - If --weight is specified, master participates in job selection with that weight
-
Stefy Lanza (nextime / spora ) authored
- Cluster master now participates in job selection even when clients are connected - Local workers compete with external workers based on weight and VRAM
-
Stefy Lanza (nextime / spora ) authored
- Jobs are inserted as 'queued', not 'processing' - Cluster master now finds and assigns queued jobs
-
Stefy Lanza (nextime / spora ) authored
- Shows total VRAM detected on local GPUs when registering processes
-
Stefy Lanza (nextime / spora ) authored
- Local workers now require sufficient VRAM like other workers - Since the server has 24GB VRAM and jobs need 16GB, the check passes normally
-
Stefy Lanza (nextime / spora ) authored
- Local jobs now monitor for completion and handle results - Prevents jobs from hanging without result retrieval
-
Stefy Lanza (nextime / spora ) authored
- Changed local client ID to 'local' and marked as local to prevent cleanup - Local clients are not cleaned up after 60 seconds - Prevents 'Client local disconnected' messages
-
Stefy Lanza (nextime / spora ) authored
- Register local processes in cluster master when weight > 0 - Handle local job assignment by forwarding to backend via TCP - Allows jobs to run locally when no cluster clients are connected
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket - Backend acts as proper bridge: web interface (TCP)
↔ workers (Unix socket) - Cluster client now communicates with backend the same way as web interface - This fixes the timeout issue and ensures proper job flow through the backend -
Stefy Lanza (nextime / spora ) authored
- Added clean_queue API endpoint in web.py for admin users - Added clean_queue database function to delete all queued/processing jobs - Added Clean Queue button to admin dashboard template - Button is only visible to admin users and allows clearing stuck jobs
-
Stefy Lanza (nextime / spora ) authored
- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode - This prevents 'Address already in use' error when running as cluster client - Cluster client now only manages worker processes, not the backend
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to start a local backend process alongside workers - Backend process handles communication between cluster client and local workers - Fixed process cleanup to properly terminate backend and worker processes - This resolves the timeout issue when cluster client forwards jobs to local backend
-
Stefy Lanza (nextime / spora ) authored
- Modified queue.py to allow retried jobs to use distributed processing when available - Fixed async coroutine warning by adding await to _transfer_job_files call - Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist
-
Stefy Lanza (nextime / spora ) authored
- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods - Added proper exception handling for websocket send operations - When websocket send fails due to broken connection, clients are now properly removed from available workers selection - This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers
-
Stefy Lanza (nextime / spora ) authored
- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message() - Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing - Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers
-
Stefy Lanza (nextime / spora ) authored
- Added last_status_print timestamp to QueueManager class - Modified _process_queue to only print job status messages once every 10 seconds - This prevents console spam from the queue manager when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop - This ensures consistent rate limiting where all job status messages are either printed together or not at all
-
Stefy Lanza (nextime / spora ) authored
- Added last_job_status_print timestamp to ClusterMaster class - Modified _management_loop to only print job status messages once every 10 seconds - This prevents console spam when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Added consecutive_failures and failing flags to client tracking - Increment failure counter on job failures, reset on success - Mark clients as failing after 3 consecutive failures - Exclude failing clients from worker selection in all methods - Reset failure tracking when clients reconnect - This prevents problematic clients from receiving jobs until they reconnect
-
Stefy Lanza (nextime / spora ) authored
- Added missing import os at the top of vidai/cluster_client.py - Removed redundant local import os in _handle_model_transfer_complete function - This fixes the error when handling master commands on client receiving a job
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
- Added delete button next to 'View Result' for completed jobs in history.html - Button appears only for completed jobs and includes confirmation dialog - Uses existing /job/{job_id}/delete route which already handles ownership checks - Maintains consistent styling with other action buttons Users can now clean up their completed job history by deleting individual jobs they no longer need.
-
Stefy Lanza (nextime / spora ) authored
- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection: - Added flexible GPU info parsing with fallbacks - Support for incomplete GPU info structures - Allow CPU workers as fallback when no GPU workers available - Added detailed debug logging for troubleshooting worker selection - Fixed queue._execute_local_job() to properly poll for backend results: - Changed from simulate processing to actual result polling - Added timeout handling (10 minutes max) - Proper error handling for failed jobs - Simplified backend.handle_web_message() to use local worker routing: - Removed async cluster master calls that were failing - Use direct worker socket communication for local processing - These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.
-
Stefy Lanza (nextime / spora ) authored
- Fixed JavaScript error in analyze.html: 'data.result.substring is not a function' by checking if data.result is a string before calling substring, converting objects to JSON string if needed - Added debug logging to cluster_master.select_worker_for_job() to diagnose why no distributed workers are found when GPU clients are connected - Debug logs show available processes and process queue to help identify registration issues This should resolve the JavaScript console error and help debug why cluster workers aren't being selected for jobs.
-