Fix distributed worker selection and local job processing
- Enhanced cluster_master.select_worker_for_job() with more robust GPU detection: - Added flexible GPU info parsing with fallbacks - Support for incomplete GPU info structures - Allow CPU workers as fallback when no GPU workers available - Added detailed debug logging for troubleshooting worker selection - Fixed queue._execute_local_job() to properly poll for backend results: - Changed from simulate processing to actual result polling - Added timeout handling (10 minutes max) - Proper error handling for failed jobs - Simplified backend.handle_web_message() to use local worker routing: - Removed async cluster master calls that were failing - Use direct worker socket communication for local processing - These changes should resolve the 'No suitable distributed worker' issue and make local processing work properly The system now properly detects GPU workers, falls back to CPU workers if needed, and correctly processes jobs locally when distributed workers aren't available.
Showing
Please
register
or
sign in
to comment