Commits · 09a3589ac56ab92dd427e22cecdf4ede10fd0dcc · SexHackMe / vidai

08 Oct, 2025 40 commits

Change local job monitoring to poll database instead of TCP · 09a3589a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use get_queue_by_job_id to check job status
- More reliable than TCP polling for local jobs
```
09a3589a
Add missing pending_jobs attribute to ClusterMaster · 08a9a99e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Initialize self.pending_jobs dict for job monitoring tasks
- Fixes AttributeError when assigning local jobs
```
08a9a99e
Fix message type for local job assignment · 42837adf
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use 'analyze_request' instead of 'analysis_request'
- Match the expected message type in worker processes
```
42837adf

Fix Message object attribute access in job monitoring · fb38da16

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Change response.get('msg_type') to response.msg_type
- Change response.get('data') to response.data
- Message objects don't have get method, use attributes instead

fb38da16

Restore automatic weight adjustment only when weight is not explicitly set · 384c96f5

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- If --weight is not specified, master weight changes to 0 when clients connect
- If --weight is specified, master participates in job selection with that weight

384c96f5

Remove automatic weight adjustment when clients connect · de670441

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Cluster master now participates in job selection even when clients are connected
- Local workers compete with external workers based on weight and VRAM

de670441

Fix cluster master to look for queued jobs instead of processing · 89c63cf8
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Jobs are inserted as 'queued', not 'processing'
- Cluster master now finds and assigns queued jobs
```
89c63cf8
Add debug output for detected local GPU VRAM · acc3a58a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Shows total VRAM detected on local GPUs when registering processes
```
acc3a58a

Remove special VRAM allowance for local workers · 826da5da

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Local workers now require sufficient VRAM like other workers
- Since the server has 24GB VRAM and jobs need 16GB, the check passes normally

826da5da

Add job result monitoring for local job assignment · 837264f3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Local jobs now monitor for completion and handle results
- Prevents jobs from hanging without result retrieval
```
837264f3

Fix local client registration to prevent disconnection · 9b864708

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Changed local client ID to 'local' and marked as local to prevent cleanup
- Local clients are not cleaned up after 60 seconds
- Prevents 'Client local disconnected' messages

9b864708

Fix cluster master to run jobs locally when no clients connected · 50877d40

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Register local processes in cluster master when weight > 0
- Handle local job assignment by forwarding to backend via TCP
- Allows jobs to run locally when no cluster clients are connected

50877d40

Fix cluster client communication architecture · 6b673482

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket
- Backend acts as proper bridge: web interface (TCP) ↔ workers (Unix socket)
- Cluster client now communicates with backend the same way as web interface
- This fixes the timeout issue and ensures proper job flow through the backend

6b673482

Add clean queue functionality to admin dashboard · 7986d63b

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added clean_queue API endpoint in web.py for admin users
- Added clean_queue database function to delete all queued/processing jobs
- Added Clean Queue button to admin dashboard template
- Button is only visible to admin users and allows clearing stuck jobs

7986d63b

Fix duplicate backend startup in cluster client · b2e953fa

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode
- This prevents 'Address already in use' error when running as cluster client
- Cluster client now only manages worker processes, not the backend

b2e953fa

Fix cluster client communication by starting local backend process · 16b95c28

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to start a local backend process alongside workers
- Backend process handles communication between cluster client and local workers
- Fixed process cleanup to properly terminate backend and worker processes
- This resolves the timeout issue when cluster client forwards jobs to local backend

16b95c28

Fix job re-queuing logic to prevent fallback to local processing · e28db173

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified queue.py to allow retried jobs to use distributed processing when available
- Fixed async coroutine warning by adding await to _transfer_job_files call
- Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist

e28db173

Fix client disconnection handling in cluster master · d5d30329

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods
- Added proper exception handling for websocket send operations
- When websocket send fails due to broken connection, clients are now properly removed from available workers selection
- This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers

d5d30329

Fix cluster client Message object and job re-assignment issues · a2c308f1

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message()
- Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing
- Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers

a2c308f1

Rate limit queue manager console messages · e449b4a6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_status_print timestamp to QueueManager class
- Modified _process_queue to only print job status messages once every 10 seconds
- This prevents console spam from the queue manager when jobs are waiting for workers

e449b4a6

Fix rate limiting logic for console messages · 93413b21

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop
- This ensures consistent rate limiting where all job status messages are either printed together or not at all

93413b21

Rate limit console messages in cluster master · 958814d8

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_job_status_print timestamp to ClusterMaster class
- Modified _management_loop to only print job status messages once every 10 seconds
- This prevents console spam when jobs are waiting for workers

958814d8

Implement client failure tracking and exclusion · d98510b6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added consecutive_failures and failing flags to client tracking
- Increment failure counter on job failures, reset on success
- Mark clients as failing after 3 consecutive failures
- Exclude failing clients from worker selection in all methods
- Reset failure tracking when clients reconnect
- This prevents problematic clients from receiving jobs until they reconnect

d98510b6

Fix NameError: name 'os' is not defined in cluster_client.py · 20795a80

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added missing import os at the top of vidai/cluster_client.py
- Removed redundant local import os in _handle_model_transfer_complete function
- This fixes the error when handling master commands on client receiving a job

20795a80

Instantiate queue manager with backend startup · a9a5da9c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

a9a5da9c
Add initialization log to QueueManager · 98e514f7
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

98e514f7
Add debug logging to queue manager · 24bf0c4b
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

24bf0c4b
Add logging when jobs are waiting for workers · be00cb97
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

be00cb97
Re-queue jobs when no workers are available instead of failing · 109e1dc9
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

109e1dc9
Add logging for job assignment in cluster master · 5f82047c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

5f82047c
Fix GPU detection in cluster master to use correct keys from gpu_info · df7dfa12
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

df7dfa12
Simplify distributed worker check to use connected clients · f065a2ae
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

f065a2ae
Modify cluster master to poll database for jobs to assign · cc35fff0
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

cc35fff0
Store cluster processes in database for queue to check available workers · ee3095c3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

ee3095c3
Integrate cluster master into backend process so queue can access worker state · fd1c1545
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

fd1c1545
Add delete button for failed jobs in history · 1d5a0b69
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

1d5a0b69
Fix import error for get_result and modify queue to keep jobs queued when no workers available · 2f2c962e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

2f2c962e

Add delete button for completed jobs in history · 55528540

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added delete button next to 'View Result' for completed jobs in history.html
- Button appears only for completed jobs and includes confirmation dialog
- Uses existing /job/{job_id}/delete route which already handles ownership checks
- Maintains consistent styling with other action buttons

Users can now clean up their completed job history by deleting individual jobs they no longer need.

55528540

Fix distributed worker selection and local job processing · 93a6daac

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection:
  - Added flexible GPU info parsing with fallbacks
  - Support for incomplete GPU info structures
  - Allow CPU workers as fallback when no GPU workers available
  - Added detailed debug logging for troubleshooting worker selection
- Fixed queue._execute_local_job() to properly poll for backend results:
  - Changed from simulate processing to actual result polling
  - Added timeout handling (10 minutes max)
  - Proper error handling for failed jobs
- Simplified backend.handle_web_message() to use local worker routing:
  - Removed async cluster master calls that were failing
  - Use direct worker socket communication for local processing
- These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly

The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.

93a6daac

Fix job scheduling and JavaScript errors · 90609675

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed JavaScript error in analyze.html: 'data.result.substring is not a function' by checking if data.result is a string before calling substring, converting objects to JSON string if needed
- Added debug logging to cluster_master.select_worker_for_job() to diagnose why no distributed workers are found when GPU clients are connected
- Debug logs show available processes and process queue to help identify registration issues

This should resolve the JavaScript console error and help debug why cluster workers aren't being selected for jobs.

90609675