Commits · 599fa8f370106e600f6a6c1b1e2df9243bbfae66 · SexHackMe / vidai

08 Oct, 2025 40 commits

Add restart functionality for cancelled jobs · 599fa8f3

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Add restart button in job history for cancelled jobs
- Add /job/<id>/restart route in web interface
- Add restart_job method in QueueManager to reset cancelled jobs to queued

599fa8f3

Add debug prints to worker and revert monitoring to TCP · 82f5cbfe
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Worker now prints when receiving jobs and sending results
- Cluster master uses TCP polling for consistency with clients
```
82f5cbfe
Change local job monitoring to poll database instead of TCP · 09a3589a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use get_queue_by_job_id to check job status
- More reliable than TCP polling for local jobs
```
09a3589a
Add missing pending_jobs attribute to ClusterMaster · 08a9a99e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Initialize self.pending_jobs dict for job monitoring tasks
- Fixes AttributeError when assigning local jobs
```
08a9a99e
Fix message type for local job assignment · 42837adf
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use 'analyze_request' instead of 'analysis_request'
- Match the expected message type in worker processes
```
42837adf

Fix Message object attribute access in job monitoring · fb38da16

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Change response.get('msg_type') to response.msg_type
- Change response.get('data') to response.data
- Message objects don't have get method, use attributes instead

fb38da16

Restore automatic weight adjustment only when weight is not explicitly set · 384c96f5

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- If --weight is not specified, master weight changes to 0 when clients connect
- If --weight is specified, master participates in job selection with that weight

384c96f5

Remove automatic weight adjustment when clients connect · de670441

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Cluster master now participates in job selection even when clients are connected
- Local workers compete with external workers based on weight and VRAM

de670441

Fix cluster master to look for queued jobs instead of processing · 89c63cf8
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Jobs are inserted as 'queued', not 'processing'
- Cluster master now finds and assigns queued jobs
```
89c63cf8
Add debug output for detected local GPU VRAM · acc3a58a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Shows total VRAM detected on local GPUs when registering processes
```
acc3a58a

Remove special VRAM allowance for local workers · 826da5da

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Local workers now require sufficient VRAM like other workers
- Since the server has 24GB VRAM and jobs need 16GB, the check passes normally

826da5da

Add job result monitoring for local job assignment · 837264f3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Local jobs now monitor for completion and handle results
- Prevents jobs from hanging without result retrieval
```
837264f3

Fix local client registration to prevent disconnection · 9b864708

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Changed local client ID to 'local' and marked as local to prevent cleanup
- Local clients are not cleaned up after 60 seconds
- Prevents 'Client local disconnected' messages

9b864708

Fix cluster master to run jobs locally when no clients connected · 50877d40

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Register local processes in cluster master when weight > 0
- Handle local job assignment by forwarding to backend via TCP
- Allows jobs to run locally when no cluster clients are connected

50877d40

Fix cluster client communication architecture · 6b673482

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket
- Backend acts as proper bridge: web interface (TCP) ↔ workers (Unix socket)
- Cluster client now communicates with backend the same way as web interface
- This fixes the timeout issue and ensures proper job flow through the backend

6b673482

Add clean queue functionality to admin dashboard · 7986d63b

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added clean_queue API endpoint in web.py for admin users
- Added clean_queue database function to delete all queued/processing jobs
- Added Clean Queue button to admin dashboard template
- Button is only visible to admin users and allows clearing stuck jobs

7986d63b

Fix duplicate backend startup in cluster client · b2e953fa

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode
- This prevents 'Address already in use' error when running as cluster client
- Cluster client now only manages worker processes, not the backend

b2e953fa

Fix cluster client communication by starting local backend process · 16b95c28

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to start a local backend process alongside workers
- Backend process handles communication between cluster client and local workers
- Fixed process cleanup to properly terminate backend and worker processes
- This resolves the timeout issue when cluster client forwards jobs to local backend

16b95c28

Fix job re-queuing logic to prevent fallback to local processing · e28db173

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified queue.py to allow retried jobs to use distributed processing when available
- Fixed async coroutine warning by adding await to _transfer_job_files call
- Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist

e28db173

Fix client disconnection handling in cluster master · d5d30329

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods
- Added proper exception handling for websocket send operations
- When websocket send fails due to broken connection, clients are now properly removed from available workers selection
- This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers

d5d30329

Fix cluster client Message object and job re-assignment issues · a2c308f1

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message()
- Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing
- Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers

a2c308f1

Rate limit queue manager console messages · e449b4a6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_status_print timestamp to QueueManager class
- Modified _process_queue to only print job status messages once every 10 seconds
- This prevents console spam from the queue manager when jobs are waiting for workers

e449b4a6

Fix rate limiting logic for console messages · 93413b21

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop
- This ensures consistent rate limiting where all job status messages are either printed together or not at all

93413b21

Rate limit console messages in cluster master · 958814d8

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_job_status_print timestamp to ClusterMaster class
- Modified _management_loop to only print job status messages once every 10 seconds
- This prevents console spam when jobs are waiting for workers

958814d8

Implement client failure tracking and exclusion · d98510b6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added consecutive_failures and failing flags to client tracking
- Increment failure counter on job failures, reset on success
- Mark clients as failing after 3 consecutive failures
- Exclude failing clients from worker selection in all methods
- Reset failure tracking when clients reconnect
- This prevents problematic clients from receiving jobs until they reconnect

d98510b6

Fix NameError: name 'os' is not defined in cluster_client.py · 20795a80

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added missing import os at the top of vidai/cluster_client.py
- Removed redundant local import os in _handle_model_transfer_complete function
- This fixes the error when handling master commands on client receiving a job

20795a80

Instantiate queue manager with backend startup · a9a5da9c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

a9a5da9c
Add initialization log to QueueManager · 98e514f7
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

98e514f7
Add debug logging to queue manager · 24bf0c4b
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

24bf0c4b
Add logging when jobs are waiting for workers · be00cb97
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

be00cb97
Re-queue jobs when no workers are available instead of failing · 109e1dc9
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

109e1dc9
Add logging for job assignment in cluster master · 5f82047c
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

5f82047c
Fix GPU detection in cluster master to use correct keys from gpu_info · df7dfa12
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

df7dfa12
Simplify distributed worker check to use connected clients · f065a2ae
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

f065a2ae
Modify cluster master to poll database for jobs to assign · cc35fff0
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

cc35fff0
Store cluster processes in database for queue to check available workers · ee3095c3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

ee3095c3
Integrate cluster master into backend process so queue can access worker state · fd1c1545
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

fd1c1545
Add delete button for failed jobs in history · 1d5a0b69
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

1d5a0b69
Fix import error for get_result and modify queue to keep jobs queued when no workers available · 2f2c962e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025

2f2c962e

Add delete button for completed jobs in history · 55528540

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added delete button next to 'View Result' for completed jobs in history.html
- Button appears only for completed jobs and includes confirmation dialog
- Uses existing /job/{job_id}/delete route which already handles ownership checks
- Maintains consistent styling with other action buttons

Users can now clean up their completed job history by deleting individual jobs they no longer need.

55528540