- 08 Oct, 2025 40 commits
-
-
Stefy Lanza (nextime / spora ) authored
- Added clean_queue API endpoint in web.py for admin users - Added clean_queue database function to delete all queued/processing jobs - Added Clean Queue button to admin dashboard template - Button is only visible to admin users and allows clearing stuck jobs
-
Stefy Lanza (nextime / spora ) authored
- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode - This prevents 'Address already in use' error when running as cluster client - Cluster client now only manages worker processes, not the backend
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to start a local backend process alongside workers - Backend process handles communication between cluster client and local workers - Fixed process cleanup to properly terminate backend and worker processes - This resolves the timeout issue when cluster client forwards jobs to local backend
-
Stefy Lanza (nextime / spora ) authored
- Modified queue.py to allow retried jobs to use distributed processing when available - Fixed async coroutine warning by adding await to _transfer_job_files call - Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist
-
Stefy Lanza (nextime / spora ) authored
- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods - Added proper exception handling for websocket send operations - When websocket send fails due to broken connection, clients are now properly removed from available workers selection - This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers
-
Stefy Lanza (nextime / spora ) authored
- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message() - Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing - Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers
-
Stefy Lanza (nextime / spora ) authored
- Added last_status_print timestamp to QueueManager class - Modified _process_queue to only print job status messages once every 10 seconds - This prevents console spam from the queue manager when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop - This ensures consistent rate limiting where all job status messages are either printed together or not at all
-
Stefy Lanza (nextime / spora ) authored
- Added last_job_status_print timestamp to ClusterMaster class - Modified _management_loop to only print job status messages once every 10 seconds - This prevents console spam when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Added consecutive_failures and failing flags to client tracking - Increment failure counter on job failures, reset on success - Mark clients as failing after 3 consecutive failures - Exclude failing clients from worker selection in all methods - Reset failure tracking when clients reconnect - This prevents problematic clients from receiving jobs until they reconnect
-
Stefy Lanza (nextime / spora ) authored
- Added missing import os at the top of vidai/cluster_client.py - Removed redundant local import os in _handle_model_transfer_complete function - This fixes the error when handling master commands on client receiving a job
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
-
Stefy Lanza (nextime / spora ) authored
- Added delete button next to 'View Result' for completed jobs in history.html - Button appears only for completed jobs and includes confirmation dialog - Uses existing /job/{job_id}/delete route which already handles ownership checks - Maintains consistent styling with other action buttons Users can now clean up their completed job history by deleting individual jobs they no longer need.
-
Stefy Lanza (nextime / spora ) authored
- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection: - Added flexible GPU info parsing with fallbacks - Support for incomplete GPU info structures - Allow CPU workers as fallback when no GPU workers available - Added detailed debug logging for troubleshooting worker selection - Fixed queue._execute_local_job() to properly poll for backend results: - Changed from simulate processing to actual result polling - Added timeout handling (10 minutes max) - Proper error handling for failed jobs - Simplified backend.handle_web_message() to use local worker routing: - Removed async cluster master calls that were failing - Use direct worker socket communication for local processing - These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.
-
Stefy Lanza (nextime / spora ) authored
- Fixed JavaScript error in analyze.html: 'data.result.substring is not a function' by checking if data.result is a string before calling substring, converting objects to JSON string if needed - Added debug logging to cluster_master.select_worker_for_job() to diagnose why no distributed workers are found when GPU clients are connected - Debug logs show available processes and process queue to help identify registration issues This should resolve the JavaScript console error and help debug why cluster workers aren't being selected for jobs.
-
Stefy Lanza (nextime / spora ) authored
- Added cancel_job method to QueueManager for cancelling running jobs - Added /job/<id>/cancel route in web.py for cancelling jobs via POST - Updated history.html template to show: - Cancel button for processing jobs (orange button) - Delete button for cancelled jobs (red button) - Cancelled status styling (gray background) - Added JavaScript updateJobActions function for dynamic action updates - Modified worker_analysis.py to check for job cancellation during processing: - Added check_job_cancelled function to query database - Modified analyze_media to check cancellation before each frame and summary - Workers now stop processing and return 'Job cancelled by user' message - Updated queue.py to pass job_id in data sent to workers for cancellation checking - Job cancellation works for both local and distributed workers Users can now cancel running analysis jobs from the history page, and cancelled jobs can be deleted from history.
-
Stefy Lanza (nextime / spora ) authored
- Made transformers import conditional in models.py to avoid import errors when not installed - Fixed update_queue_status call to use 'error' parameter instead of 'error_message' - Added checks for transformers availability before using it in model loading - This resolves the ModuleNotFoundError and TypeError when running jobs The system can now handle job scheduling even when transformers library is not available, and properly reports errors when job execution fails.
-
Stefy Lanza (nextime / spora ) authored
- Added get_queue_status to the database imports in web.py - This fixes the NameError when accessing /api/job_status/<job_id> - The job status API endpoint now works correctly for real-time job monitoring
-
Stefy Lanza (nextime / spora ) authored
- Changed analyze route to stay on page after job submission instead of redirecting to history - Added submitted_job parameter to template to track current job - Modified sidebar to show for all users (not just admins) - Added job progress section in sidebar that displays: - Job ID and status (queued/processing/completed/failed) - Tokens used - Result preview - Added /api/job_status/<job_id> endpoint for real-time job status - Added JavaScript polling for job status updates every 2 seconds - Job progress updates automatically without page refresh - Users can see their analysis job progress in real-time in the sidebar The analyze page now provides immediate feedback and progress tracking instead of requiring navigation to the history page.
-
Stefy Lanza (nextime / spora ) authored
- Fixed process type mapping in queue manager ('analyze' -> 'analysis', 'train' -> 'training') - Implemented actual job sending in cluster master assign_job_to_worker() - Modified cluster client to forward jobs to local backend and monitor results - Added result polling mechanism for cluster jobs - Jobs should now execute on connected cluster workers instead of remaining queued The issue was that jobs were being assigned but never sent to workers. Now: 1. Queue manager selects worker using VRAM-aware logic 2. Cluster master assigns job and sends it via websocket 3. Cluster client receives job and forwards to local backend 4. Cluster client polls backend for results and sends back to master 5. Results are properly returned to web interface
-
Stefy Lanza (nextime / spora ) authored
- Updated queue manager to use select_worker_for_job() and assign_job_to_worker() instead of unimplemented assign_job_with_model() - Now properly implements VRAM-aware worker selection based on model requirements - Jobs will be assigned to distributed workers when available with sufficient VRAM - Falls back to local processing when no suitable distributed worker is found - Added proper error handling and logging for job assignment process
-
Stefy Lanza (nextime / spora ) authored
- Fixed 'list.append() takes exactly one argument (0 given)' error in update_queue_status - Removed empty params.append() calls for timestamp fields that use CURRENT_TIMESTAMP directly - Queue processing should now work correctly without errors
-
Stefy Lanza (nextime / spora ) authored
- Add delete_queue_item function to database.py with ownership validation - Add delete_job method to QueueManager class - Add /job/<id>/delete endpoint in web.py with user authentication - Update history.html template to show delete button for queued jobs - Only allow users to delete their own jobs or admins to delete any job - Add confirmation dialog for job deletion
-
Stefy Lanza (nextime / spora ) authored
- Make assign_job_with_model synchronous and return None to trigger local fallback - Remove asyncio from queue processing to avoid threading issues - Jobs now properly execute locally when no distributed workers are available - Maintain async file transfer infrastructure for future distributed worker support
-
Stefy Lanza (nextime / spora ) authored
- Convert assign_job_with_model and assign_job_to_worker to synchronous methods - Remove asyncio dependencies from queue processing - Simplify model transfer to avoid async websocket calls for now - Fix syntax errors in cluster_master.py
-
Stefy Lanza (nextime / spora ) authored
- Add VRAM requirement estimation for models - Implement intelligent worker selection based on VRAM, weight, and load - Add support for concurrent job execution with VRAM tracking - Integrate RunPod pod creation and management as fallback - Implement model loading/unloading logic to avoid redundant transfers - Add file transfer handling for shared storage and websocket fallback - Update queue system to use advanced cluster scheduling - Add job_id column to processing_queue table - Update web interface to submit jobs through queue system - Fix function name references in cluster master
-
Stefy Lanza (nextime / spora ) authored
Implement advanced job scheduling system with VRAM-aware worker selection, job queuing, RunPod integration, model management, and real-time notifications - Add VRAM requirement determination for models - Implement intelligent worker selection based on VRAM availability and weights - Add job queuing when no workers available with automatic retry - Integrate RunPod pod creation and management for scaling - Implement model loading/unloading with reference counting - Add file transfer support for remote workers (shared storage + websocket) - Enable concurrent job processing with VRAM tracking - Create job results page with detailed output display - Add real-time job completion notifications with result links - Update history page with live progress updates and result links - Fix async handling in cluster master websocket communication - Add database schema updates for job tracking
-
Stefy Lanza (nextime / spora ) authored
- When only 1 video-capable model exists, include hidden input field - Form still submits the correct model_path value - Maintains proper form functionality while hiding UI when not needed
-