Commits · 826da5daa5967aa26bc692de1532aa7dbaecaff1 · SexHackMe / vidai

08 Oct, 2025 40 commits

Remove special VRAM allowance for local workers · 826da5da

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Local workers now require sufficient VRAM like other workers
- Since the server has 24GB VRAM and jobs need 16GB, the check passes normally

826da5da

Add job result monitoring for local job assignment · 837264f3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Local jobs now monitor for completion and handle results
- Prevents jobs from hanging without result retrieval
```
837264f3

Fix local client registration to prevent disconnection · 9b864708

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Changed local client ID to 'local' and marked as local to prevent cleanup
- Local clients are not cleaned up after 60 seconds
- Prevents 'Client local disconnected' messages

9b864708

Fix cluster master to run jobs locally when no clients connected · 50877d40

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Register local processes in cluster master when weight > 0
- Handle local job assignment by forwarding to backend via TCP
- Allows jobs to run locally when no cluster clients are connected

50877d40

Fix cluster client communication architecture · 6b673482

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket
- Backend acts as proper bridge: web interface (TCP) ↔ workers (Unix socket)
- Cluster client now communicates with backend the same way as web interface
- This fixes the timeout issue and ensures proper job flow through the backend

6b673482

Add clean queue functionality to admin dashboard · 7986d63b

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added clean_queue API endpoint in web.py for admin users
- Added clean_queue database function to delete all queued/processing jobs
- Added Clean Queue button to admin dashboard template
- Button is only visible to admin users and allows clearing stuck jobs

7986d63b

Fix duplicate backend startup in cluster client · b2e953fa

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode
- This prevents 'Address already in use' error when running as cluster client
- Cluster client now only manages worker processes, not the backend

b2e953fa

Fix cluster client communication by starting local backend process · 16b95c28

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to start a local backend process alongside workers
- Backend process handles communication between cluster client and local workers
- Fixed process cleanup to properly terminate backend and worker processes
- This resolves the timeout issue when cluster client forwards jobs to local backend

16b95c28

Fix job re-queuing logic to prevent fallback to local processing · e28db173

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified queue.py to allow retried jobs to use distributed processing when available
- Fixed async coroutine warning by adding await to _transfer_job_files call
- Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist

e28db173

Fix client disconnection handling in cluster master · d5d30329

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods
- Added proper exception handling for websocket send operations
- When websocket send fails due to broken connection, clients are now properly removed from available workers selection
- This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers

d5d30329

Fix cluster client Message object and job re-assignment issues · a2c308f1

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message()
- Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing
- Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers

a2c308f1

Rate limit queue manager console messages · e449b4a6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_status_print timestamp to QueueManager class
- Modified _process_queue to only print job status messages once every 10 seconds
- This prevents console spam from the queue manager when jobs are waiting for workers

e449b4a6

Fix rate limiting logic for console messages · 93413b21

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop
- This ensures consistent rate limiting where all job status messages are either printed together or not at all

93413b21

Rate limit console messages in cluster master · 958814d8

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_job_status_print timestamp to ClusterMaster class
- Modified _management_loop to only print job status messages once every 10 seconds
- This prevents console spam when jobs are waiting for workers

958814d8

Implement client failure tracking and exclusion · d98510b6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added consecutive_failures and failing flags to client tracking
- Increment failure counter on job failures, reset on success
- Mark clients as failing after 3 consecutive failures
- Exclude failing clients from worker selection in all methods
- Reset failure tracking when clients reconnect
- This prevents problematic clients from receiving jobs until they reconnect

d98510b6

Fix NameError: name 'os' is not defined in cluster_client.py · 20795a80

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added missing import os at the top of vidai/cluster_client.py
- Removed redundant local import os in _handle_model_transfer_complete function
- This fixes the error when handling master commands on client receiving a job

20795a80

Instantiate queue manager with backend startup · a9a5da9c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

a9a5da9c
Add initialization log to QueueManager · 98e514f7
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

98e514f7
Add debug logging to queue manager · 24bf0c4b
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

24bf0c4b
Add logging when jobs are waiting for workers · be00cb97
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

be00cb97
Re-queue jobs when no workers are available instead of failing · 109e1dc9
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

109e1dc9
Add logging for job assignment in cluster master · 5f82047c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

5f82047c
Fix GPU detection in cluster master to use correct keys from gpu_info · df7dfa12
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

df7dfa12
Simplify distributed worker check to use connected clients · f065a2ae
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

f065a2ae
Modify cluster master to poll database for jobs to assign · cc35fff0
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

cc35fff0
Store cluster processes in database for queue to check available workers · ee3095c3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

ee3095c3
Integrate cluster master into backend process so queue can access worker state · fd1c1545
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

fd1c1545
Add delete button for failed jobs in history · 1d5a0b69
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

1d5a0b69
Fix import error for get_result and modify queue to keep jobs queued when no workers available · 2f2c962e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

2f2c962e

Add delete button for completed jobs in history · 55528540

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added delete button next to 'View Result' for completed jobs in history.html
- Button appears only for completed jobs and includes confirmation dialog
- Uses existing /job/{job_id}/delete route which already handles ownership checks
- Maintains consistent styling with other action buttons

Users can now clean up their completed job history by deleting individual jobs they no longer need.

55528540

Fix distributed worker selection and local job processing · 93a6daac

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection:
  - Added flexible GPU info parsing with fallbacks
  - Support for incomplete GPU info structures
  - Allow CPU workers as fallback when no GPU workers available
  - Added detailed debug logging for troubleshooting worker selection
- Fixed queue._execute_local_job() to properly poll for backend results:
  - Changed from simulate processing to actual result polling
  - Added timeout handling (10 minutes max)
  - Proper error handling for failed jobs
- Simplified backend.handle_web_message() to use local worker routing:
  - Removed async cluster master calls that were failing
  - Use direct worker socket communication for local processing
- These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly

The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.

93a6daac

Fix job scheduling and JavaScript errors · 90609675

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed JavaScript error in analyze.html: 'data.result.substring is not a function' by checking if data.result is a string before calling substring, converting objects to JSON string if needed
- Added debug logging to cluster_master.select_worker_for_job() to diagnose why no distributed workers are found when GPU clients are connected
- Debug logs show available processes and process queue to help identify registration issues

This should resolve the JavaScript console error and help debug why cluster workers aren't being selected for jobs.

90609675

Add job cancellation functionality to history interface · 71340283

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added cancel_job method to QueueManager for cancelling running jobs
- Added /job/<id>/cancel route in web.py for cancelling jobs via POST
- Updated history.html template to show:
  - Cancel button for processing jobs (orange button)
  - Delete button for cancelled jobs (red button)
  - Cancelled status styling (gray background)
- Added JavaScript updateJobActions function for dynamic action updates
- Modified worker_analysis.py to check for job cancellation during processing:
  - Added check_job_cancelled function to query database
  - Modified analyze_media to check cancellation before each frame and summary
  - Workers now stop processing and return 'Job cancelled by user' message
- Updated queue.py to pass job_id in data sent to workers for cancellation checking
- Job cancellation works for both local and distributed workers

Users can now cancel running analysis jobs from the history page, and cancelled jobs can be deleted from history.

71340283

Fix import and parameter issues in job scheduling · f048470d

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Made transformers import conditional in models.py to avoid import errors when not installed
- Fixed update_queue_status call to use 'error' parameter instead of 'error_message'
- Added checks for transformers availability before using it in model loading
- This resolves the ModuleNotFoundError and TypeError when running jobs

The system can now handle job scheduling even when transformers library is not available, and properly reports errors when job execution fails.

f048470d

Fix missing import for get_queue_status in api_job_status endpoint · 2dc5eebf

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added get_queue_status to the database imports in web.py
- This fixes the NameError when accessing /api/job_status/<job_id>
- The job status API endpoint now works correctly for real-time job monitoring

2dc5eebf

Modify analyze page to show job progress in sidebar instead of redirecting · 4d03625c

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Changed analyze route to stay on page after job submission instead of redirecting to history
- Added submitted_job parameter to template to track current job
- Modified sidebar to show for all users (not just admins)
- Added job progress section in sidebar that displays:
  - Job ID and status (queued/processing/completed/failed)
  - Tokens used
  - Result preview
- Added /api/job_status/<job_id> endpoint for real-time job status
- Added JavaScript polling for job status updates every 2 seconds
- Job progress updates automatically without page refresh
- Users can see their analysis job progress in real-time in the sidebar

The analyze page now provides immediate feedback and progress tracking instead of requiring navigation to the history page.

4d03625c

Fix job execution in cluster: implement proper job assignment and result handling · 50a87bee

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed process type mapping in queue manager ('analyze' -> 'analysis', 'train' -> 'training')
- Implemented actual job sending in cluster master assign_job_to_worker()
- Modified cluster client to forward jobs to local backend and monitor results
- Added result polling mechanism for cluster jobs
- Jobs should now execute on connected cluster workers instead of remaining queued

The issue was that jobs were being assigned but never sent to workers. Now:
1. Queue manager selects worker using VRAM-aware logic
2. Cluster master assigns job and sends it via websocket
3. Cluster client receives job and forwards to local backend
4. Cluster client polls backend for results and sends back to master
5. Results are properly returned to web interface

50a87bee

Fix job scheduling: use proper cluster master methods · 0de46dee

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Updated queue manager to use select_worker_for_job() and assign_job_to_worker() instead of unimplemented assign_job_with_model()
- Now properly implements VRAM-aware worker selection based on model requirements
- Jobs will be assigned to distributed workers when available with sufficient VRAM
- Falls back to local processing when no suitable distributed worker is found
- Added proper error handling and logging for job assignment process

0de46dee

Fix queue processing error: remove empty params.append() calls · b25f154f

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed 'list.append() takes exactly one argument (0 given)' error in update_queue_status
- Removed empty params.append() calls for timestamp fields that use CURRENT_TIMESTAMP directly
- Queue processing should now work correctly without errors

b25f154f

Add job deletion functionality · fa1c8156

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Add delete_queue_item function to database.py with ownership validation
- Add delete_job method to QueueManager class
- Add /job/<id>/delete endpoint in web.py with user authentication
- Update history.html template to show delete button for queued jobs
- Only allow users to delete their own jobs or admins to delete any job
- Add confirmation dialog for job deletion

fa1c8156