- 08 Oct, 2025 40 commits
-
-
Stefy Lanza (nextime / spora ) authored
- Local workers now require sufficient VRAM like other workers - Since the server has 24GB VRAM and jobs need 16GB, the check passes normally
-
Stefy Lanza (nextime / spora ) authored
- Local jobs now monitor for completion and handle results - Prevents jobs from hanging without result retrieval
-
Stefy Lanza (nextime / spora ) authored
- Changed local client ID to 'local' and marked as local to prevent cleanup - Local clients are not cleaned up after 60 seconds - Prevents 'Client local disconnected' messages
-
Stefy Lanza (nextime / spora ) authored
- Register local processes in cluster master when weight > 0 - Handle local job assignment by forwarding to backend via TCP - Allows jobs to run locally when no cluster clients are connected
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket - Backend acts as proper bridge: web interface (TCP)
↔ workers (Unix socket) - Cluster client now communicates with backend the same way as web interface - This fixes the timeout issue and ensures proper job flow through the backend -
Stefy Lanza (nextime / spora ) authored
- Added clean_queue API endpoint in web.py for admin users - Added clean_queue database function to delete all queued/processing jobs - Added Clean Queue button to admin dashboard template - Button is only visible to admin users and allows clearing stuck jobs
-
Stefy Lanza (nextime / spora ) authored
- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode - This prevents 'Address already in use' error when running as cluster client - Cluster client now only manages worker processes, not the backend
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to start a local backend process alongside workers - Backend process handles communication between cluster client and local workers - Fixed process cleanup to properly terminate backend and worker processes - This resolves the timeout issue when cluster client forwards jobs to local backend
-
Stefy Lanza (nextime / spora ) authored
- Modified queue.py to allow retried jobs to use distributed processing when available - Fixed async coroutine warning by adding await to _transfer_job_files call - Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist
-
Stefy Lanza (nextime / spora ) authored
- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods - Added proper exception handling for websocket send operations - When websocket send fails due to broken connection, clients are now properly removed from available workers selection - This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers
-
Stefy Lanza (nextime / spora ) authored
- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message() - Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing - Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers
-
Stefy Lanza (nextime / spora ) authored
- Added last_status_print timestamp to QueueManager class - Modified _process_queue to only print job status messages once every 10 seconds - This prevents console spam from the queue manager when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop - This ensures consistent rate limiting where all job status messages are either printed together or not at all
-
Stefy Lanza (nextime / spora ) authored
- Added last_job_status_print timestamp to ClusterMaster class - Modified _management_loop to only print job status messages once every 10 seconds - This prevents console spam when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Added consecutive_failures and failing flags to client tracking - Increment failure counter on job failures, reset on success - Mark clients as failing after 3 consecutive failures - Exclude failing clients from worker selection in all methods - Reset failure tracking when clients reconnect - This prevents problematic clients from receiving jobs until they reconnect
-
Stefy Lanza (nextime / spora ) authored
- Added missing import os at the top of vidai/cluster_client.py - Removed redundant local import os in _handle_model_transfer_complete function - This fixes the error when handling master commands on client receiving a job
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
- Added delete button next to 'View Result' for completed jobs in history.html - Button appears only for completed jobs and includes confirmation dialog - Uses existing /job/{job_id}/delete route which already handles ownership checks - Maintains consistent styling with other action buttons Users can now clean up their completed job history by deleting individual jobs they no longer need.
-
Stefy Lanza (nextime / spora ) authored
- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection: - Added flexible GPU info parsing with fallbacks - Support for incomplete GPU info structures - Allow CPU workers as fallback when no GPU workers available - Added detailed debug logging for troubleshooting worker selection - Fixed queue._execute_local_job() to properly poll for backend results: - Changed from simulate processing to actual result polling - Added timeout handling (10 minutes max) - Proper error handling for failed jobs - Simplified backend.handle_web_message() to use local worker routing: - Removed async cluster master calls that were failing - Use direct worker socket communication for local processing - These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.
-
Stefy Lanza (nextime / spora ) authored
- Fixed JavaScript error in analyze.html: 'data.result.substring is not a function' by checking if data.result is a string before calling substring, converting objects to JSON string if needed - Added debug logging to cluster_master.select_worker_for_job() to diagnose why no distributed workers are found when GPU clients are connected - Debug logs show available processes and process queue to help identify registration issues This should resolve the JavaScript console error and help debug why cluster workers aren't being selected for jobs.
-
Stefy Lanza (nextime / spora ) authored
- Added cancel_job method to QueueManager for cancelling running jobs - Added /job/<id>/cancel route in web.py for cancelling jobs via POST - Updated history.html template to show: - Cancel button for processing jobs (orange button) - Delete button for cancelled jobs (red button) - Cancelled status styling (gray background) - Added JavaScript updateJobActions function for dynamic action updates - Modified worker_analysis.py to check for job cancellation during processing: - Added check_job_cancelled function to query database - Modified analyze_media to check cancellation before each frame and summary - Workers now stop processing and return 'Job cancelled by user' message - Updated queue.py to pass job_id in data sent to workers for cancellation checking - Job cancellation works for both local and distributed workers Users can now cancel running analysis jobs from the history page, and cancelled jobs can be deleted from history.
-
Stefy Lanza (nextime / spora ) authored
- Made transformers import conditional in models.py to avoid import errors when not installed - Fixed update_queue_status call to use 'error' parameter instead of 'error_message' - Added checks for transformers availability before using it in model loading - This resolves the ModuleNotFoundError and TypeError when running jobs The system can now handle job scheduling even when transformers library is not available, and properly reports errors when job execution fails.
-
Stefy Lanza (nextime / spora ) authored
- Added get_queue_status to the database imports in web.py - This fixes the NameError when accessing /api/job_status/<job_id> - The job status API endpoint now works correctly for real-time job monitoring
-
Stefy Lanza (nextime / spora ) authored
- Changed analyze route to stay on page after job submission instead of redirecting to history - Added submitted_job parameter to template to track current job - Modified sidebar to show for all users (not just admins) - Added job progress section in sidebar that displays: - Job ID and status (queued/processing/completed/failed) - Tokens used - Result preview - Added /api/job_status/<job_id> endpoint for real-time job status - Added JavaScript polling for job status updates every 2 seconds - Job progress updates automatically without page refresh - Users can see their analysis job progress in real-time in the sidebar The analyze page now provides immediate feedback and progress tracking instead of requiring navigation to the history page.
-
Stefy Lanza (nextime / spora ) authored
- Fixed process type mapping in queue manager ('analyze' -> 'analysis', 'train' -> 'training') - Implemented actual job sending in cluster master assign_job_to_worker() - Modified cluster client to forward jobs to local backend and monitor results - Added result polling mechanism for cluster jobs - Jobs should now execute on connected cluster workers instead of remaining queued The issue was that jobs were being assigned but never sent to workers. Now: 1. Queue manager selects worker using VRAM-aware logic 2. Cluster master assigns job and sends it via websocket 3. Cluster client receives job and forwards to local backend 4. Cluster client polls backend for results and sends back to master 5. Results are properly returned to web interface
-
Stefy Lanza (nextime / spora ) authored
- Updated queue manager to use select_worker_for_job() and assign_job_to_worker() instead of unimplemented assign_job_with_model() - Now properly implements VRAM-aware worker selection based on model requirements - Jobs will be assigned to distributed workers when available with sufficient VRAM - Falls back to local processing when no suitable distributed worker is found - Added proper error handling and logging for job assignment process
-
Stefy Lanza (nextime / spora ) authored
- Fixed 'list.append() takes exactly one argument (0 given)' error in update_queue_status - Removed empty params.append() calls for timestamp fields that use CURRENT_TIMESTAMP directly - Queue processing should now work correctly without errors
-
Stefy Lanza (nextime / spora ) authored
- Add delete_queue_item function to database.py with ownership validation - Add delete_job method to QueueManager class - Add /job/<id>/delete endpoint in web.py with user authentication - Update history.html template to show delete button for queued jobs - Only allow users to delete their own jobs or admins to delete any job - Add confirmation dialog for job deletion
-