- 08 Oct, 2025 40 commits
-
-
Stefy Lanza (nextime / spora ) authored
- Remove active jobs older than 10 minutes to prevent accumulation - Cleans up leftover jobs from previous runs or crashes - Prevents duplicate job tracking issues
-
Stefy Lanza (nextime / spora ) authored
- Management loop only completes cancelled jobs, doesn't send duplicate cancel messages - Cancel_job sends the cancel command, management loop cleans up - Prevents duplicate cancellation logs
-
Stefy Lanza (nextime / spora ) authored
- Check if job is already assigned before assigning again - Prevents multiple active jobs for the same queue entry - Fixes duplicate cancellation attempts
-
Stefy Lanza (nextime / spora ) authored
- Pass queue_id to workers for proper cancellation detection - Workers now check correct job id for cancellation status - Workers receive effective stop commands via database polling
-
Stefy Lanza (nextime / spora ) authored
- Workers receive stop processing commands when jobs are cancelled - Ensures workers halt processing immediately on cancel - Maintains proper cleanup of resources
-
Stefy Lanza (nextime / spora ) authored
- Workers are freed immediately when jobs are cancelled - Clean up active jobs in cluster master when cancelling processing jobs - Remove unnecessary cleanup from restart (handled by cancel)
-
Stefy Lanza (nextime / spora ) authored
- Implements job cancellation by notifying workers - Sends cancel messages to local backend or remote clients - Cleans up cancelled job resources
-
Stefy Lanza (nextime / spora ) authored
- Pass queue_id to _assign_local_job method - Fix NameError when assigning local jobs
-
Stefy Lanza (nextime / spora ) authored
- Store queue_id in active_jobs tracking - Properly detect cancelled jobs by checking queue status - Clean up worker resources when jobs are cancelled - Workers become available for new jobs after cancellation
-
Stefy Lanza (nextime / spora ) authored
- Cluster master sends cancel_job messages to backend/client when jobs are cancelled - Add _handle_cancel_job to process cancel confirmations from clients - Workers can be notified faster to stop processing and free resources
-
Stefy Lanza (nextime / spora ) authored
- Restarted jobs set to 'queued' status - Cluster master looks for 'queued' jobs, sets to 'processing', then assigns - Proper job lifecycle: queued -> processing -> assigned -> completed/failed
-
Stefy Lanza (nextime / spora ) authored
- Restarted jobs now set to 'processing' with empty job_id - Cluster master will pick up restarted jobs for assignment
-
Stefy Lanza (nextime / spora ) authored
- QueueManager now only handles job submission and management - Job processing is handled exclusively by cluster master - Eliminates duplicate queue processing between web and cluster processes
-
Stefy Lanza (nextime / spora ) authored
- Show when worker connects to backend - Show when worker registers - Help debug why jobs aren't being received
-
Stefy Lanza (nextime / spora ) authored
- Queue manager marks jobs as processing, cluster master assigns them - Changed query back to look for processing jobs without job_id
-
Stefy Lanza (nextime / spora ) authored
- Always use cluster master for job assignment, even for local jobs - Remove separate local processing path - Local processes treated same as remote, except for auto weight adjustment
-
Stefy Lanza (nextime / spora ) authored
- Allow local jobs to start even if worker socket check fails - Backend handles worker availability, queue manager should always allow local processing
-
Stefy Lanza (nextime / spora ) authored
- Add restart button in job history for cancelled jobs - Add /job/<id>/restart route in web interface - Add restart_job method in QueueManager to reset cancelled jobs to queued
-
Stefy Lanza (nextime / spora ) authored
- Worker now prints when receiving jobs and sending results - Cluster master uses TCP polling for consistency with clients
-
Stefy Lanza (nextime / spora ) authored
- Use get_queue_by_job_id to check job status - More reliable than TCP polling for local jobs
-
Stefy Lanza (nextime / spora ) authored
- Initialize self.pending_jobs dict for job monitoring tasks - Fixes AttributeError when assigning local jobs
-
Stefy Lanza (nextime / spora ) authored
- Use 'analyze_request' instead of 'analysis_request' - Match the expected message type in worker processes
-
Stefy Lanza (nextime / spora ) authored
- Change response.get('msg_type') to response.msg_type - Change response.get('data') to response.data - Message objects don't have get method, use attributes instead
-
Stefy Lanza (nextime / spora ) authored
- If --weight is not specified, master weight changes to 0 when clients connect - If --weight is specified, master participates in job selection with that weight
-
Stefy Lanza (nextime / spora ) authored
- Cluster master now participates in job selection even when clients are connected - Local workers compete with external workers based on weight and VRAM
-
Stefy Lanza (nextime / spora ) authored
- Jobs are inserted as 'queued', not 'processing' - Cluster master now finds and assigns queued jobs
-
Stefy Lanza (nextime / spora ) authored
- Shows total VRAM detected on local GPUs when registering processes
-
Stefy Lanza (nextime / spora ) authored
- Local workers now require sufficient VRAM like other workers - Since the server has 24GB VRAM and jobs need 16GB, the check passes normally
-
Stefy Lanza (nextime / spora ) authored
- Local jobs now monitor for completion and handle results - Prevents jobs from hanging without result retrieval
-
Stefy Lanza (nextime / spora ) authored
- Changed local client ID to 'local' and marked as local to prevent cleanup - Local clients are not cleaned up after 60 seconds - Prevents 'Client local disconnected' messages
-
Stefy Lanza (nextime / spora ) authored
- Register local processes in cluster master when weight > 0 - Handle local job assignment by forwarding to backend via TCP - Allows jobs to run locally when no cluster clients are connected
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket - Backend acts as proper bridge: web interface (TCP)
↔ workers (Unix socket) - Cluster client now communicates with backend the same way as web interface - This fixes the timeout issue and ensures proper job flow through the backend -
Stefy Lanza (nextime / spora ) authored
- Added clean_queue API endpoint in web.py for admin users - Added clean_queue database function to delete all queued/processing jobs - Added Clean Queue button to admin dashboard template - Button is only visible to admin users and allows clearing stuck jobs
-
Stefy Lanza (nextime / spora ) authored
- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode - This prevents 'Address already in use' error when running as cluster client - Cluster client now only manages worker processes, not the backend
-
Stefy Lanza (nextime / spora ) authored
- Modified cluster client to start a local backend process alongside workers - Backend process handles communication between cluster client and local workers - Fixed process cleanup to properly terminate backend and worker processes - This resolves the timeout issue when cluster client forwards jobs to local backend
-
Stefy Lanza (nextime / spora ) authored
- Modified queue.py to allow retried jobs to use distributed processing when available - Fixed async coroutine warning by adding await to _transfer_job_files call - Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist
-
Stefy Lanza (nextime / spora ) authored
- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods - Added proper exception handling for websocket send operations - When websocket send fails due to broken connection, clients are now properly removed from available workers selection - This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers
-
Stefy Lanza (nextime / spora ) authored
- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message() - Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing - Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers
-
Stefy Lanza (nextime / spora ) authored
- Added last_status_print timestamp to QueueManager class - Modified _process_queue to only print job status messages once every 10 seconds - This prevents console spam from the queue manager when jobs are waiting for workers
-
Stefy Lanza (nextime / spora ) authored
- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop - This ensures consistent rate limiting where all job status messages are either printed together or not at all
-