Commits · ebcbfda5fc85cf43d4f6d3fdb4622bbb38f49bd7 · SexHackMe / vidai

08 Oct, 2025 40 commits

Add cleanup of old active jobs · ebcbfda5

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Remove active jobs older than 10 minutes to prevent accumulation
- Cleans up leftover jobs from previous runs or crashes
- Prevents duplicate job tracking issues

ebcbfda5

Fix duplicate cancellation · 7534e5d0

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Management loop only completes cancelled jobs, doesn't send duplicate cancel messages
- Cancel_job sends the cancel command, management loop cleans up
- Prevents duplicate cancellation logs

7534e5d0

Prevent double job assignment · 76ed5870

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Check if job is already assigned before assigning again
- Prevents multiple active jobs for the same queue entry
- Fixes duplicate cancellation attempts

76ed5870

Fix worker cancellation checking · ca9a9669

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Pass queue_id to workers for proper cancellation detection
- Workers now check correct job id for cancellation status
- Workers receive effective stop commands via database polling

ca9a9669

Send cancel commands to workers · b2f2786e

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Workers receive stop processing commands when jobs are cancelled
- Ensures workers halt processing immediately on cancel
- Maintains proper cleanup of resources

b2f2786e

Fix job cancellation cleanup · 855d427a

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Workers are freed immediately when jobs are cancelled
- Clean up active jobs in cluster master when cancelling processing jobs
- Remove unnecessary cleanup from restart (handled by cancel)

855d427a

Add missing _cancel_job_processing method · c633ee4d

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Implements job cancellation by notifying workers
- Sends cancel messages to local backend or remote clients
- Cleans up cancelled job resources

c633ee4d

Fix queue_id parameter passing · a5920057
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Pass queue_id to _assign_local_job method
- Fix NameError when assigning local jobs
```
a5920057

Fix job cancellation cleanup · d28295fc

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Store queue_id in active_jobs tracking
- Properly detect cancelled jobs by checking queue status
- Clean up worker resources when jobs are cancelled
- Workers become available for new jobs after cancellation

d28295fc

Add job cancellation communication · 83825820

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Cluster master sends cancel_job messages to backend/client when jobs are cancelled
- Add _handle_cancel_job to process cancel confirmations from clients
- Workers can be notified faster to stop processing and free resources

83825820

Fix job restart flow · 15dd9ddc

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Restarted jobs set to 'queued' status
- Cluster master looks for 'queued' jobs, sets to 'processing', then assigns
- Proper job lifecycle: queued -> processing -> assigned -> completed/failed

15dd9ddc

Fix job restart to set status to processing · 21d81a1c

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Restarted jobs now set to 'processing' with empty job_id
- Cluster master will pick up restarted jobs for assignment

21d81a1c

Remove queue processing from QueueManager · 2071e268

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- QueueManager now only handles job submission and management
- Job processing is handled exclusively by cluster master
- Eliminates duplicate queue processing between web and cluster processes

2071e268

Add debug prints to worker connection and registration · c243e798
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Show when worker connects to backend
- Show when worker registers
- Help debug why jobs aren't being received
```
c243e798

Fix cluster master to look for processing jobs · a2e19c42

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Queue manager marks jobs as processing, cluster master assigns them
- Changed query back to look for processing jobs without job_id

a2e19c42

Unify job processing through cluster master · 668c482a

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Always use cluster master for job assignment, even for local jobs
- Remove separate local processing path
- Local processes treated same as remote, except for auto weight adjustment

668c482a

Fix job assignment for restarted jobs · 32ef8692

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Allow local jobs to start even if worker socket check fails
- Backend handles worker availability, queue manager should always allow local processing

32ef8692

Add restart functionality for cancelled jobs · 599fa8f3

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Add restart button in job history for cancelled jobs
- Add /job/<id>/restart route in web interface
- Add restart_job method in QueueManager to reset cancelled jobs to queued

599fa8f3

Add debug prints to worker and revert monitoring to TCP · 82f5cbfe
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Worker now prints when receiving jobs and sending results
- Cluster master uses TCP polling for consistency with clients
```
82f5cbfe
Change local job monitoring to poll database instead of TCP · 09a3589a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use get_queue_by_job_id to check job status
- More reliable than TCP polling for local jobs
```
09a3589a
Add missing pending_jobs attribute to ClusterMaster · 08a9a99e
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Initialize self.pending_jobs dict for job monitoring tasks
- Fixes AttributeError when assigning local jobs
```
08a9a99e
Fix message type for local job assignment · 42837adf
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Use 'analyze_request' instead of 'analysis_request'
- Match the expected message type in worker processes
```
42837adf

Fix Message object attribute access in job monitoring · fb38da16

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Change response.get('msg_type') to response.msg_type
- Change response.get('data') to response.data
- Message objects don't have get method, use attributes instead

fb38da16

Restore automatic weight adjustment only when weight is not explicitly set · 384c96f5

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- If --weight is not specified, master weight changes to 0 when clients connect
- If --weight is specified, master participates in job selection with that weight

384c96f5

Remove automatic weight adjustment when clients connect · de670441

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Cluster master now participates in job selection even when clients are connected
- Local workers compete with external workers based on weight and VRAM

de670441

Fix cluster master to look for queued jobs instead of processing · 89c63cf8
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Jobs are inserted as 'queued', not 'processing'
- Cluster master now finds and assigns queued jobs
```
89c63cf8
Add debug output for detected local GPU VRAM · acc3a58a
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Shows total VRAM detected on local GPUs when registering processes
```
acc3a58a

Remove special VRAM allowance for local workers · 826da5da

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Local workers now require sufficient VRAM like other workers
- Since the server has 24GB VRAM and jobs need 16GB, the check passes normally

826da5da

Add job result monitoring for local job assignment · 837264f3
Stefy Lanza (nextime / spora ) authored Oct 08, 2025
```
- Local jobs now monitor for completion and handle results
- Prevents jobs from hanging without result retrieval
```
837264f3

Fix local client registration to prevent disconnection · 9b864708

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Changed local client ID to 'local' and marked as local to prevent cleanup
- Local clients are not cleaned up after 60 seconds
- Prevents 'Client local disconnected' messages

9b864708

Fix cluster master to run jobs locally when no clients connected · 50877d40

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Register local processes in cluster master when weight > 0
- Handle local job assignment by forwarding to backend via TCP
- Allows jobs to run locally when no cluster clients are connected

50877d40

Fix cluster client communication architecture · 6b673482

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to connect to backend's TCP web port instead of worker Unix socket
- Backend acts as proper bridge: web interface (TCP) ↔ workers (Unix socket)
- Cluster client now communicates with backend the same way as web interface
- This fixes the timeout issue and ensures proper job flow through the backend

6b673482

Add clean queue functionality to admin dashboard · 7986d63b

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added clean_queue API endpoint in web.py for admin users
- Added clean_queue database function to delete all queued/processing jobs
- Added Clean Queue button to admin dashboard template
- Button is only visible to admin users and allows clearing stuck jobs

7986d63b

Fix duplicate backend startup in cluster client · b2e953fa

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Removed backend process startup from cluster_client.py since vidai.py already starts it for client mode
- This prevents 'Address already in use' error when running as cluster client
- Cluster client now only manages worker processes, not the backend

b2e953fa

Fix cluster client communication by starting local backend process · 16b95c28

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified cluster client to start a local backend process alongside workers
- Backend process handles communication between cluster client and local workers
- Fixed process cleanup to properly terminate backend and worker processes
- This resolves the timeout issue when cluster client forwards jobs to local backend

16b95c28

Fix job re-queuing logic to prevent fallback to local processing · e28db173

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Modified queue.py to allow retried jobs to use distributed processing when available
- Fixed async coroutine warning by adding await to _transfer_job_files call
- Jobs that fail on clients will now be properly re-queued for distributed processing instead of falling back to local workers that may not exist

e28db173

Fix client disconnection handling in cluster master · d5d30329

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Made assign_job_to_worker, _transfer_job_files, _transfer_file_via_websocket, enable_process, disable_process, update_process_weight, restart_client_workers, and restart_client_worker async methods
- Added proper exception handling for websocket send operations
- When websocket send fails due to broken connection, clients are now properly removed from available workers selection
- This ensures that disconnected clients are immediately removed from the worker pool and jobs are re-assigned to available workers

d5d30329

Fix cluster client Message object and job re-assignment issues · a2c308f1

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Fixed cluster_client.py to send proper Message objects instead of dicts to backend_comm.send_message()
- Modified queue.py to prevent failed jobs from being immediately re-assigned to distributed processing
- Jobs with retry_count > 0 now use local processing to avoid loops with failing distributed workers

a2c308f1

Rate limit queue manager console messages · e449b4a6

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Added last_status_print timestamp to QueueManager class
- Modified _process_queue to only print job status messages once every 10 seconds
- This prevents console spam from the queue manager when jobs are waiting for workers

e449b4a6

Fix rate limiting logic for console messages · 93413b21

Stefy Lanza (nextime / spora ) authored Oct 08, 2025

- Calculate should_print_status once per loop iteration instead of updating timestamp inside the loop
- This ensures consistent rate limiting where all job status messages are either printed together or not at all

93413b21