-
Notifications
You must be signed in to change notification settings - Fork 450
Removes debugging code from backfill-prompts
#1008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* adding PuzzleMatcherVerifier * Update ground_truth_utils.py * Create test_ground_truth_utils.py * Update ground_truth_utils.py * Update ground_truth_utils.py * lint * lint, address gemini * Puzzle -> puzzle * Fixed tests. * Cleaned up test * Cleaned up test --------- Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]>
* fixes for manual eval jobs with new cluster names * fix default cluster names
* Disables cascade attention. * fix issue with creation of LLMEngine.
372b8f6
to
9e2921a
Compare
…s assignment The removal of asserts accidentally removed the line that stores output_copy in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures tries to access the tracking data for tool processing. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
- Removed unused 'prefix' variable in add_request function - Removed unused 'request_id' variable in _prefetch_worker - Removed unused 'training_step' variable in _maybe_process_and_insert - Removed unused 'context' variable in _poll_tool_futures 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]>
* Changes have been made. * Updated function signature of _finalize_outputs. * Added tests for make_request_id. * Deleted old function. * Cleaned up test * Removed spurious except clause * Fixed bug where we were passing the wrong sampling params to _finalize_outputs. * Fixed tests. * Ran linter.
* check that we have enough data * lint * comments * update comment * minor update
…mpleted_request function.
github-merge-queue bot
pushed a commit
that referenced
this pull request
Sep 16, 2025
…so that we are continually processing prompts. (#998) * PAsses single prompts through. * Updated queue_types.py to match. * Fixed issue with code. * Fixed queue sizing issue. * Updated length of tool use experiment. * Merged conflicts * Corrected inference batch size calculation. * Fixed merge errors. * Undid changes ot test file. * UNdid changes. * Cleaned up code * Another attempt to fix the dataset_index bug. * Another attempt to fix the dataset_index bug. * Added assert statements. * Removed debugging code. * Less changes * Undid changes * Cleaned up PR. * Fixed change in sort order * Clean up PR * Cleaned up code. * Cleaned up PR. * Fixed issue. * Fixed linter errors. * Updated tool_grpo_fast.sh to use new workspace. * Removed redundant test. * Added back whitespace. * Ran linter. * Refactored code. * Cleaned up PR. * Fixed linter error. * Removed logging. * Removed logging statement. * Attempt at fix mask mismatch issue. * Tests should pass now. * Updated timing code. * Ran linter. * Added timing. * Timing is fast now. * Remove timing instrumentation code - Remove detailed timing instrumentation from accumulate_inference_batches - Remove enhanced Timer class duration property and initialization timing - Remove process_from_queue timing breakdown logging - Preserve all functional changes including single-prompt processing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Added lots of debugging statements. * Ran linter. Fixed bug. * Added test file * Removed whitespace * Updated script. * Cleaned up code. * Removed debugging code. * Fixed failing test. * Set timeout for tests. They should take 5 minutes to run. * Now, tests should pass. * now tests should pass. * Linter passes. * Now, stream prompts into and out of the engine. * now, tests should pass * now, tests should actually pass * now, tests should actually pass * Cleaned up the way we do streaming. * Ran linter. * Undid changes to test. * Refactored code. * Fixed bug. * Removed comment. * Cleaned up PR. * Removed second fill_engine call. * Fixed linter error. * updated code * Attempt at fix from gpt5-pro. * Updated timing code * Updated timing code * Tons of debugging * Lots of changes, debugging, seems to run but slow? * Remove debugging/logging code from actor_manager.py and grpo_fast.py - Cleaned up excessive logging in actor_manager setup functions - Removed debug timing code from load_data_from_packing_thread - Restored clean function signatures and error handling * Remove profiling and hanging operation debug instrumentation from vllm_utils3.py - Removed ProfilerContext class and cProfile imports - Removed warn_if_slow and warn_if_slow_hanging_operation decorators - Removed HANGING_OPERATION_WARNING_TIMEOUT_S constant - Cleaned up method calls wrapped with timing decorators * Cleaned up PR * Added back removed functionality. * Cleaned up PR. * Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations * Revert "Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations" This reverts commit 6d4d027. * Adds a prefetch worker * removed notify * Added test, fixed prefetch bug * Added verbose logging guard * Cleaned up implementation * Lots of help with prefetch thread * Now, we refill on every step, and block on calls to insert data. * Now, large_test_script.sh successfully completes. * Updated prefetch thread to enqueue requests automatically. * Added retries with backoff for queue operations * Set reraise=True. * Updated code to catch specific errors * Fixed syntaxerror. * now, tests pass * Removed robust queue operations * Updated code to clean it up * Updatd code * Removed uv.lock * Removed fill_engine. * Tried to fix error in tool use * A bunch of attempted fixes. * Updated code ot catch issue * Attempt to fix resolution issue. * made changes suggested by ChatGPT * made chatgpt suggestions * Claude suggested some changes to assertions. * Another attempt at fixing by adding comprehensive assertions * made a bunch of changes * Now, processing persists across calls to process_from_queue * Added local tool script * Updated code * Lots of changes * Fixed exit condition * Cleaned up code * Added a fix to make sure we don't process a request until all tool futures are complete. * Fixed bug where we were accidentally adding unfinished requests to request_outputs. * Updated script so we can't launch experiments with a dirty repo. * fixed bug * mdofied script * Attempt at making the flow correct * Added additional tracking. * Added more logging * Fixed bug that was incorrectly considering a request complete. * Checks for active vllm continuations * Fixed sub-requests * Added additional assertions to track bug down. * Updated code to add validation * Added more validation. * Fixed error in validation * Added more validation. * Fixed error in validation. * Added more validation. * Fixed validations. * updated validation. * Fixed a bug with processing. Added a function to handle finishing a request. * Moved where we call validation to be in a more appropriate place. * Fixed validation to be more comprehensive. * Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput]. * Fixed else clause. * Fixed bug where we were modifying the tracking dict while iterating over it. * Fixed validation * Removed redundant safety check. * Added more logging. * Attempt at fixing issue * Moves delete * Fixed bug in tool code. * Ran linter. Tool script seems to run fine. * Fixed bug in non-tool use. * Updated code to fix index bug. * Undid host network change to tool_grpo_fast.sh. * Fixed no host networking command from tool_grpo_fast.sh. * Fixed incorrect index field. * Fixed index * Fix request tracking and training_step propagation for non-tool use - Use dataclasses.replace to properly set CompletionOutput.index when tools disabled - Make request_tracking keys unique by combining training_step and dataset_index - Pass training_step through to GenerationResult to enable proper tracking key construction This fixes the hang that occurred when tools were disabled due to: 1. Incorrect index values preventing output matching 2. Duplicate dataset indices across training steps causing prefetch thread crash * Fix KeyError by accessing training_step before cleanup The recent commit changed the tracking system to use combined keys (training_step_dataset_index), but there was a bug where we tried to access request_metadata[request_id]["training_step"] after calling _cleanup_request_data, which removes the metadata. This fix gets the training_step value before calling _cleanup_request_data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add assertion to verify all n sub-requests are tracked This assertion will help debug the hanging issue in single_gpu_on_beaker.sh by ensuring that all n sub-requests are properly added to vllm_active_requests. If the assertion fails, it means some sub-requests are not being tracked properly after being added to vLLM. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix index field timing issue in non-tool mode with n>1 Move index field assignment before adding CompletionOutput to request_outputs to ensure _maybe_process_and_insert can properly identify sub-requests. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix CompletionOutput index for non-tool mode with n>1 When tools are disabled and n>1, fix the index field immediately when receiving outputs from vLLM. Each sub-request gets the correct index extracted from its ID (e.g., train_3_12_1 -> index=1). Added assertions to verify the fix is working correctly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Changed loop * Set verbose to true * Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor. * changed removal from vllm_active_requests to occur after _finalize * Fix sub-request tracking race condition in non-tools mode The issue occurred when tools were disabled due to premature removal of sub-requests from vllm_active_requests. This caused validation failures where expected sub-requests couldn't be found. The fix ensures sub-requests are removed from vllm_active_requests at the correct time - after determining they won't go to tools but before calling _finalize_sub_request to avoid deadlock in _maybe_process_and_insert. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removes debugging code from `backfill-prompts` (#1008) * remove flash infer from dependencies, mason, and benchmark script (#1000) * update eval names (#1001) * Puzzle verifier (#1003) * adding PuzzleMatcherVerifier * Update ground_truth_utils.py * Create test_ground_truth_utils.py * Update ground_truth_utils.py * Update ground_truth_utils.py * lint * lint, address gemini * Puzzle -> puzzle * Fixed tests. * Cleaned up test * Cleaned up test --------- Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> * fixes for manual eval jobs with new cluster names (#1002) * fixes for manual eval jobs with new cluster names * fix default cluster names * Disables cascade attention. (#1006) * Disables cascade attention. * fix issue with creation of LLMEngine. * whoops wrong name (#1007) * Clean up * Removed request_tracking. * Removed tracking key call * Fixed attribute error * Removed tracking key * Removed asserts * Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment The removal of asserts accidentally removed the line that stores output_copy in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures tries to access the tracking data for tool processing. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix linter errors: remove unused variables - Removed unused 'prefix' variable in add_request function - Removed unused 'request_id' variable in _prefetch_worker - Removed unused 'training_step' variable in _maybe_process_and_insert - Removed unused 'context' variable in _poll_tool_futures 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Undid local changes. * Fixed accidental deletion * Removed copy * Cleaned function signatures. * Cleaned up finalize_outputs * Added script. (#1009) * Changed vllm_active_requests into a set. * Lots of clean up. * Cleaned up _maybe_process_and_insert significantly. * Attempt at fixing empty response issue. * Fixed errors and testing with fake asserts. (#1010) * Small refactor to prepare for the `backfill-prompts` PR (#1011) * Changes have been made. * Updated function signature of _finalize_outputs. * Added tests for make_request_id. * Deleted old function. * Cleaned up test * Removed spurious except clause * Fixed bug where we were passing the wrong sampling params to _finalize_outputs. * Fixed tests. * Ran linter. * attempted to fix bug in completions. * Removed _finalize_outputs, renamed process_outputs. * Check we have enough data for bsz/prefill (#1012) * check that we have enough data * lint * comments * update comment * minor update * Cleaned up code significantly. * Cleaned up tests. * Combine process_output and _process_completed_request into process_completed_request function. * Removed duplicate definition of _extract_base_request_id. * Fixed linter error. --------- Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]> Co-authored-by: Claude <[email protected]> * Fixed linter error. * Removed _has_Pending_engine_requests_for-base_id method. * Used a dictionary instead of a loop. --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Sep 17, 2025
* PAsses single prompts through. * Updated queue_types.py to match. * Fixed issue with code. * Fixed queue sizing issue. * Updated length of tool use experiment. * Merged conflicts * Corrected inference batch size calculation. * Fixed merge errors. * Undid changes ot test file. * UNdid changes. * Cleaned up code * Another attempt to fix the dataset_index bug. * Another attempt to fix the dataset_index bug. * Added assert statements. * Removed debugging code. * Less changes * Undid changes * Cleaned up PR. * Fixed change in sort order * Clean up PR * Cleaned up code. * Cleaned up PR. * Fixed issue. * Fixed linter errors. * Updated tool_grpo_fast.sh to use new workspace. * Removed redundant test. * Added back whitespace. * Ran linter. * Refactored code. * Cleaned up PR. * Fixed linter error. * Removed logging. * Removed logging statement. * Attempt at fix mask mismatch issue. * Tests should pass now. * Updated timing code. * Ran linter. * Added timing. * Timing is fast now. * Remove timing instrumentation code - Remove detailed timing instrumentation from accumulate_inference_batches - Remove enhanced Timer class duration property and initialization timing - Remove process_from_queue timing breakdown logging - Preserve all functional changes including single-prompt processing 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Added lots of debugging statements. * Ran linter. Fixed bug. * Added test file * Removed whitespace * Updated script. * Cleaned up code. * Removed debugging code. * Fixed failing test. * Set timeout for tests. They should take 5 minutes to run. * Now, tests should pass. * now tests should pass. * Linter passes. * Now, stream prompts into and out of the engine. * now, tests should pass * now, tests should actually pass * now, tests should actually pass * Cleaned up the way we do streaming. * Ran linter. * Undid changes to test. * Refactored code. * Fixed bug. * Removed comment. * Cleaned up PR. * Removed second fill_engine call. * Fixed linter error. * updated code * Attempt at fix from gpt5-pro. * Updated timing code * Updated timing code * Tons of debugging * Lots of changes, debugging, seems to run but slow? * Remove debugging/logging code from actor_manager.py and grpo_fast.py - Cleaned up excessive logging in actor_manager setup functions - Removed debug timing code from load_data_from_packing_thread - Restored clean function signatures and error handling * Remove profiling and hanging operation debug instrumentation from vllm_utils3.py - Removed ProfilerContext class and cProfile imports - Removed warn_if_slow and warn_if_slow_hanging_operation decorators - Removed HANGING_OPERATION_WARNING_TIMEOUT_S constant - Cleaned up method calls wrapped with timing decorators * Cleaned up PR * Added back removed functionality. * Cleaned up PR. * Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations * Revert "Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations" This reverts commit 6d4d027. * Adds a prefetch worker * removed notify * Added test, fixed prefetch bug * Added verbose logging guard * Cleaned up implementation * Lots of help with prefetch thread * Now, we refill on every step, and block on calls to insert data. * Now, large_test_script.sh successfully completes. * Updated prefetch thread to enqueue requests automatically. * Added retries with backoff for queue operations * Set reraise=True. * Updated code to catch specific errors * Fixed syntaxerror. * now, tests pass * Removed robust queue operations * Updated code to clean it up * Updatd code * Removed uv.lock * Removed fill_engine. * Tried to fix error in tool use * A bunch of attempted fixes. * Updated code ot catch issue * Attempt to fix resolution issue. * made changes suggested by ChatGPT * made chatgpt suggestions * Claude suggested some changes to assertions. * Another attempt at fixing by adding comprehensive assertions * made a bunch of changes * Now, processing persists across calls to process_from_queue * Added local tool script * Updated code * Lots of changes * Fixed exit condition * Cleaned up code * Added a fix to make sure we don't process a request until all tool futures are complete. * Fixed bug where we were accidentally adding unfinished requests to request_outputs. * Updated script so we can't launch experiments with a dirty repo. * fixed bug * mdofied script * Attempt at making the flow correct * Added additional tracking. * Added more logging * Fixed bug that was incorrectly considering a request complete. * Checks for active vllm continuations * Fixed sub-requests * Added additional assertions to track bug down. * Updated code to add validation * Added more validation. * Fixed error in validation * Added more validation. * Fixed error in validation. * Added more validation. * Fixed validations. * updated validation. * Fixed a bug with processing. Added a function to handle finishing a request. * Moved where we call validation to be in a more appropriate place. * Fixed validation to be more comprehensive. * Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput]. * Fixed else clause. * Fixed bug where we were modifying the tracking dict while iterating over it. * Fixed validation * Removed redundant safety check. * Added more logging. * Attempt at fixing issue * Moves delete * Fixed bug in tool code. * Ran linter. Tool script seems to run fine. * Fixed bug in non-tool use. * Updated code to fix index bug. * Undid host network change to tool_grpo_fast.sh. * Fixed no host networking command from tool_grpo_fast.sh. * Fixed incorrect index field. * Fixed index * Fix request tracking and training_step propagation for non-tool use - Use dataclasses.replace to properly set CompletionOutput.index when tools disabled - Make request_tracking keys unique by combining training_step and dataset_index - Pass training_step through to GenerationResult to enable proper tracking key construction This fixes the hang that occurred when tools were disabled due to: 1. Incorrect index values preventing output matching 2. Duplicate dataset indices across training steps causing prefetch thread crash * Fix KeyError by accessing training_step before cleanup The recent commit changed the tracking system to use combined keys (training_step_dataset_index), but there was a bug where we tried to access request_metadata[request_id]["training_step"] after calling _cleanup_request_data, which removes the metadata. This fix gets the training_step value before calling _cleanup_request_data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add assertion to verify all n sub-requests are tracked This assertion will help debug the hanging issue in single_gpu_on_beaker.sh by ensuring that all n sub-requests are properly added to vllm_active_requests. If the assertion fails, it means some sub-requests are not being tracked properly after being added to vLLM. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix index field timing issue in non-tool mode with n>1 Move index field assignment before adding CompletionOutput to request_outputs to ensure _maybe_process_and_insert can properly identify sub-requests. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix CompletionOutput index for non-tool mode with n>1 When tools are disabled and n>1, fix the index field immediately when receiving outputs from vLLM. Each sub-request gets the correct index extracted from its ID (e.g., train_3_12_1 -> index=1). Added assertions to verify the fix is working correctly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Changed loop * Set verbose to true * Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor. * changed removal from vllm_active_requests to occur after _finalize * Fix sub-request tracking race condition in non-tools mode The issue occurred when tools were disabled due to premature removal of sub-requests from vllm_active_requests. This caused validation failures where expected sub-requests couldn't be found. The fix ensures sub-requests are removed from vllm_active_requests at the correct time - after determining they won't go to tools but before calling _finalize_sub_request to avoid deadlock in _maybe_process_and_insert. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removes debugging code from `backfill-prompts` (#1008) * remove flash infer from dependencies, mason, and benchmark script (#1000) * update eval names (#1001) * Puzzle verifier (#1003) * adding PuzzleMatcherVerifier * Update ground_truth_utils.py * Create test_ground_truth_utils.py * Update ground_truth_utils.py * Update ground_truth_utils.py * lint * lint, address gemini * Puzzle -> puzzle * Fixed tests. * Cleaned up test * Cleaned up test --------- Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> * fixes for manual eval jobs with new cluster names (#1002) * fixes for manual eval jobs with new cluster names * fix default cluster names * Disables cascade attention. (#1006) * Disables cascade attention. * fix issue with creation of LLMEngine. * whoops wrong name (#1007) * Clean up * Removed request_tracking. * Removed tracking key call * Fixed attribute error * Removed tracking key * Removed asserts * Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment The removal of asserts accidentally removed the line that stores output_copy in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures tries to access the tracking data for tool processing. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix linter errors: remove unused variables - Removed unused 'prefix' variable in add_request function - Removed unused 'request_id' variable in _prefetch_worker - Removed unused 'training_step' variable in _maybe_process_and_insert - Removed unused 'context' variable in _poll_tool_futures 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Undid local changes. * Fixed accidental deletion * Removed copy * Cleaned function signatures. * Cleaned up finalize_outputs * Added script. (#1009) * Changed vllm_active_requests into a set. * Lots of clean up. * Cleaned up _maybe_process_and_insert significantly. * Attempt at fixing empty response issue. * Fixed errors and testing with fake asserts. (#1010) * Small refactor to prepare for the `backfill-prompts` PR (#1011) * Changes have been made. * Updated function signature of _finalize_outputs. * Added tests for make_request_id. * Deleted old function. * Cleaned up test * Removed spurious except clause * Fixed bug where we were passing the wrong sampling params to _finalize_outputs. * Fixed tests. * Ran linter. * attempted to fix bug in completions. * Removed _finalize_outputs, renamed process_outputs. * Check we have enough data for bsz/prefill (#1012) * check that we have enough data * lint * comments * update comment * minor update * Cleaned up code significantly. * Cleaned up tests. * Combine process_output and _process_completed_request into process_completed_request function. * Removed duplicate definition of _extract_base_request_id. * Fixed linter error. --------- Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]> Co-authored-by: Claude <[email protected]> * Fixed linter error. * Add inflight_updates argument to enable quick pausing/resumption - Added inflight_updates boolean argument to Args class in grpo_fast.py - Passed inflight_updates through to vllm_utils3.create_vllm_engines - Modified LLMRayActor to accept and handle inflight_updates mode - When inflight_updates is True, process_from_queue returns immediately when should_stop is set - When inflight_updates is False, maintains existing behavior (waits for pending work) This allows for quick pausing and resumption of request processing during model updates. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Refactor process_from_queue loop to use _should_exit method - Added _should_exit method to encapsulate all exit logic - Uses early returns with no nested conditions for clarity - Optimizes by checking expensive operations only when needed - Removed duplicate exit condition checks from the main loop - Simplified while loop to use while not self._should_exit() This makes the code more maintainable and the exit conditions clearer. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removed comment. * Longer benchmark * undid changes to benchmark * Removed _has_Pending_engine_requests_for-base_id method. * Used a dictionary instead of a loop. * Added simulated weight sync to code * Now, we set inflight updates to be true * Updated timing * Removed inflight updates * Made more reproducible * removed inflight updates * Ran linter, merged changes, and set temperature back to 1.0 * Fixed sampling params differences * Set inflight updates true --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]>
github-merge-queue bot
pushed a commit
that referenced
this pull request
Oct 17, 2025
* Now, processing persists across calls to process_from_queue * Added local tool script * Updated code * Lots of changes * Fixed exit condition * Cleaned up code * Added a fix to make sure we don't process a request until all tool futures are complete. * Fixed bug where we were accidentally adding unfinished requests to request_outputs. * Updated script so we can't launch experiments with a dirty repo. * fixed bug * mdofied script * Attempt at making the flow correct * Added additional tracking. * Added more logging * Fixed bug that was incorrectly considering a request complete. * Checks for active vllm continuations * Fixed sub-requests * Added additional assertions to track bug down. * Updated code to add validation * Added more validation. * Fixed error in validation * Added more validation. * Fixed error in validation. * Added more validation. * Fixed validations. * updated validation. * Fixed a bug with processing. Added a function to handle finishing a request. * Moved where we call validation to be in a more appropriate place. * Fixed validation to be more comprehensive. * Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput]. * Fixed else clause. * Fixed bug where we were modifying the tracking dict while iterating over it. * Fixed validation * Removed redundant safety check. * Added more logging. * Attempt at fixing issue * Moves delete * Fixed bug in tool code. * Ran linter. Tool script seems to run fine. * Fixed bug in non-tool use. * Updated code to fix index bug. * Undid host network change to tool_grpo_fast.sh. * Fixed no host networking command from tool_grpo_fast.sh. * Fixed incorrect index field. * Fixed index * Fix request tracking and training_step propagation for non-tool use - Use dataclasses.replace to properly set CompletionOutput.index when tools disabled - Make request_tracking keys unique by combining training_step and dataset_index - Pass training_step through to GenerationResult to enable proper tracking key construction This fixes the hang that occurred when tools were disabled due to: 1. Incorrect index values preventing output matching 2. Duplicate dataset indices across training steps causing prefetch thread crash * Fix KeyError by accessing training_step before cleanup The recent commit changed the tracking system to use combined keys (training_step_dataset_index), but there was a bug where we tried to access request_metadata[request_id]["training_step"] after calling _cleanup_request_data, which removes the metadata. This fix gets the training_step value before calling _cleanup_request_data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add assertion to verify all n sub-requests are tracked This assertion will help debug the hanging issue in single_gpu_on_beaker.sh by ensuring that all n sub-requests are properly added to vllm_active_requests. If the assertion fails, it means some sub-requests are not being tracked properly after being added to vLLM. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix index field timing issue in non-tool mode with n>1 Move index field assignment before adding CompletionOutput to request_outputs to ensure _maybe_process_and_insert can properly identify sub-requests. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix CompletionOutput index for non-tool mode with n>1 When tools are disabled and n>1, fix the index field immediately when receiving outputs from vLLM. Each sub-request gets the correct index extracted from its ID (e.g., train_3_12_1 -> index=1). Added assertions to verify the fix is working correctly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Changed loop * Set verbose to true * Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor. * changed removal from vllm_active_requests to occur after _finalize * Fix sub-request tracking race condition in non-tools mode The issue occurred when tools were disabled due to premature removal of sub-requests from vllm_active_requests. This caused validation failures where expected sub-requests couldn't be found. The fix ensures sub-requests are removed from vllm_active_requests at the correct time - after determining they won't go to tools but before calling _finalize_sub_request to avoid deadlock in _maybe_process_and_insert. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removes debugging code from `backfill-prompts` (#1008) * remove flash infer from dependencies, mason, and benchmark script (#1000) * update eval names (#1001) * Puzzle verifier (#1003) * adding PuzzleMatcherVerifier * Update ground_truth_utils.py * Create test_ground_truth_utils.py * Update ground_truth_utils.py * Update ground_truth_utils.py * lint * lint, address gemini * Puzzle -> puzzle * Fixed tests. * Cleaned up test * Cleaned up test --------- Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> * fixes for manual eval jobs with new cluster names (#1002) * fixes for manual eval jobs with new cluster names * fix default cluster names * Disables cascade attention. (#1006) * Disables cascade attention. * fix issue with creation of LLMEngine. * whoops wrong name (#1007) * Clean up * Removed request_tracking. * Removed tracking key call * Fixed attribute error * Removed tracking key * Removed asserts * Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment The removal of asserts accidentally removed the line that stores output_copy in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures tries to access the tracking data for tool processing. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix linter errors: remove unused variables - Removed unused 'prefix' variable in add_request function - Removed unused 'request_id' variable in _prefetch_worker - Removed unused 'training_step' variable in _maybe_process_and_insert - Removed unused 'context' variable in _poll_tool_futures 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Undid local changes. * Fixed accidental deletion * Removed copy * Cleaned function signatures. * Cleaned up finalize_outputs * Added script. (#1009) * Changed vllm_active_requests into a set. * Lots of clean up. * Cleaned up _maybe_process_and_insert significantly. * Attempt at fixing empty response issue. * Fixed errors and testing with fake asserts. (#1010) * Small refactor to prepare for the `backfill-prompts` PR (#1011) * Changes have been made. * Updated function signature of _finalize_outputs. * Added tests for make_request_id. * Deleted old function. * Cleaned up test * Removed spurious except clause * Fixed bug where we were passing the wrong sampling params to _finalize_outputs. * Fixed tests. * Ran linter. * attempted to fix bug in completions. * Removed _finalize_outputs, renamed process_outputs. * Check we have enough data for bsz/prefill (#1012) * check that we have enough data * lint * comments * update comment * minor update * Cleaned up code significantly. * Cleaned up tests. * Combine process_output and _process_completed_request into process_completed_request function. * Removed duplicate definition of _extract_base_request_id. * Fixed linter error. --------- Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]> Co-authored-by: Claude <[email protected]> * Fixed linter error. * Add inflight_updates argument to enable quick pausing/resumption - Added inflight_updates boolean argument to Args class in grpo_fast.py - Passed inflight_updates through to vllm_utils3.create_vllm_engines - Modified LLMRayActor to accept and handle inflight_updates mode - When inflight_updates is True, process_from_queue returns immediately when should_stop is set - When inflight_updates is False, maintains existing behavior (waits for pending work) This allows for quick pausing and resumption of request processing during model updates. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Refactor process_from_queue loop to use _should_exit method - Added _should_exit method to encapsulate all exit logic - Uses early returns with no nested conditions for clarity - Optimizes by checking expensive operations only when needed - Removed duplicate exit condition checks from the main loop - Simplified while loop to use while not self._should_exit() This makes the code more maintainable and the exit conditions clearer. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removed comment. * Longer benchmark * undid changes to benchmark * Now, uses the async engine. * Fixed errors * Removed host networking from single * much simpler flow * Removed tracking variable. * Cleans up lifecycle. * Updated generate_one_completion. * Cleaned up main loop. * Refactored _process_request significantly * Simplified process_from_queue * Updates code * Updated clusters * updated script * updated script * updated script * Updated script * Updated script to match launch_benchmark.sh * Fixed bug. * updated priority * Fixed kv_cache_specs * Fixed kv_cache_specs * Added logging * Fixed methods * Ran linter. * Fix blocking ray.get in async actor Convert _should_stop to async method to avoid blocking the event loop. This fixes the issue where "Engine is gracefully shutting down" message repeats indefinitely due to blocked async execution. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Improve async _should_stop to prevent blocking and duplicate requests - Track in-flight requests with _inflight_ref to prevent duplicates - Use asyncio.wait_for instead of ray.wait for proper async handling - Cancel Ray tasks on timeout to avoid resource leaks - Prevents blocking the event loop and improves async performance 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Added timeouts for all the scripts * Now, we always run the engine loop. * Fixed engine initialization. * added await to generator. * Changed loggign for vllm * Fixed logging levels * Added more timing * Fixed deadlock * set logging to debug for vllm * set logging to debug for vllm_utils3.py * Fixed timeout bug * an attempted fix * Add async engine implementation for vLLM - Replace synchronous LLMEngine with AsyncLLMEngine - Add async request processing with asyncio tasks - Implement non-blocking weight updates with inflight_updates flag - Add proper async handling for tool calls - Update debug scripts with timeouts * Attempt at fixing * Add detailed logging to track vLLM generation hang * fixed error * Fixed wait stalling * fix issues * Add comprehensive logging to debug queue hang issue - Added detailed logging to track queue operations in vllm_utils3.py - Added logging for data flow in grpo_fast.py - Track when results are put/get from Ray queues - Log queue objects to identify routing issues - Add iteration counters and request details This will help identify where the data flow is getting stuck between the vLLM generation and data preparation threads. * Add detailed logging to trace request flow through queues - Log every step from queue insertion to engine processing - Track queue object references to identify routing issues - Log prefetch task startup and iterations - Add logging to _add_request and _process_request methods - Log when requests are successfully retrieved from queues - Track active tasks and batch sizes This will help identify where requests are getting stuck between submission and processing. * Fix process_from_queue hanging issue The bug was that the prefetch task was interfering with asyncio.wait. When asyncio.wait included the prefetch task with return_when=FIRST_COMPLETED, it would return immediately when the prefetch task completed (which happens quickly when the batch is full), without waiting for actual generation tasks. This caused process_from_queue to return 0 almost immediately thousands of times, even though there were active generation tasks running. The fix: - Only wait for actual generation tasks in self.active_tasks - Don't include the prefetch task in the wait list - Remove the special handling for prefetch task in the done set - Properly wait for the timeout period when there are active tasks This ensures process_from_queue actually waits for generation to complete instead of returning immediately. * Add assertion to detect race condition in request ID reuse * Fix premature exit in vllm_utils3 when using tools The _should_exit method was incorrectly exiting when active_tasks briefly went to 0 during tool processing. This caused the vLLM actor to exit early, leaving requests unprocessed and causing the main training loop to hang. Changes: - Check for incomplete requests in request_outputs before deciding to exit - Add a grace period when no tasks are active to allow for gaps in tool processing - Prevent premature exit during multi-step tool completions This fixes the hang issue when running tool_grpo_fast.sh while maintaining compatibility with non-tool scripts like single_gpu_on_beaker.sh. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removed metadata from logs * Add detailed logging to _process_request to debug tool use hang Added comprehensive logging to track the exact flow through _process_request: - Log entry/exit points with request IDs - Track loop iterations and await statements - Log tool triggering and execution - Track when requests cannot continue - Log lock acquisition and output collection This will help identify where the hang occurs during tool processing. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Add detailed logging around tokenizer access to debug hang Added logging to identify exactly where the hang occurs when processing tool outputs. Logs track: - Before/after accessing self.llm_engine.engine.tokenizer - Before/after encoding tool output text - The actual text being encoded and number of tokens This will help determine if the hang is during tokenizer access or encoding. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Updated endpoint * Add detailed logging between lines 653-682 to identify exact hang location - Add logging after creating prompt_and_tool_output_token - Add logging after calculating excess - Add logging for can_continue state changes - Add logging after extending token_ids and masks - Add logging after calculating new_sample_tokens This will help pinpoint exactly which line causes the hang when processing empty tool outputs. * Add detailed logging to debug token_ids access hang * Fix tuple/list concatenation issue in vllm_utils3.py The issue was that request_output.outputs[0].token_ids is returned as a tuple from VLLM, but the code was trying to: 1. Extend it with .extend() (which only works on mutable lists) 2. Concatenate it directly with lists using + Fixed by converting token_ids to a list when first storing it in request_output. * Fix tuple concatenation issue in tool execution path Convert output.prompt_token_ids from tuple to list before concatenation to prevent hanging when processing tool outputs. VLLM returns tuples which cannot be concatenated with lists using the + operator. * Add detailed debugging to track types during token concatenation Added logging to understand what types are being concatenated and where the hang occurs in the tool execution path. * Fix attribute access for max_model_len in tool execution path The AsyncLLMEngine wrapper doesn't have model_config directly. Need to access it through self.llm_engine.engine.model_config instead. This was causing a hang when trying to access the attribute. * updated to use timeouterror * fixed syntax error * More logging * Updated code * Fixed loop * Fixed logging * Attempted to mirror the synchronous loop. * hold the lock less. * less frequent logging * Updated behaviour to match * Fixed max exceeded calls * Updated tool path behaviour * Fix request ID collision in async engine for tool continuations Use unique iteration-specific request IDs (e.g. train_4_20273_3_iter2) to avoid vLLM engine conflicts when handling multiple iterations with tools. The base request_id is still used as cache_salt for optimal prefix caching. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Removed the lock * CLeaned up PR. * Update async engine (#1043) * Longer sleep * Updated mason.py to set default env vars centrally, and set a much longer vllm iteration timeout. * Removed vllm_enable_V1_multiprocessing. * Set debug * Set vllm_enforce_eager to large_test_script. * Set inflight updates to False * Added test script * Fix vLLM async engine hanging by adding polling with timeouts The vLLM AsyncLLMEngine was hanging because its background task wasn't getting CPU time to process requests. The async generators were stuck waiting indefinitely for outputs that never came. This fix: - Adds an explicit yield after getting the generator to let the background task start - Replaces the blocking async for loop with a polling approach using asyncio.wait_for - Yields control to the event loop on timeouts to allow background tasks to run - Adds a maximum wait time to prevent indefinite hangs This ensures the vLLM engine's background task gets regular opportunities to process requests and yield outputs. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Fix vLLM async generator corruption from timeout The previous fix using asyncio.wait_for on generator.__anext__() was corrupting the generator state. After a timeout, the generator would immediately raise StopAsyncIteration, causing requests to fail after only 0.1 seconds. This fix uses a task-based approach: - Create a background task to consume the generator naturally - Poll the task for completion while yielding control - Never timeout on the generator itself to avoid corruption - Cancel the task cleanly if overall timeout is reached This ensures the generator runs normally while still yielding control to the vLLM engine's background processing. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Run AsyncLLMEngine in separate thread to fix event loop issues The vLLM AsyncLLMEngine wasn't getting proper event loop time in the Ray actor context, causing generators to hang. This fix: 1. Runs the AsyncLLMEngine in a dedicated thread with its own event loop 2. Converts Ray-called methods to synchronous with thread-safe bridges 3. Uses asyncio.run_coroutine_threadsafe to submit work to engine thread Key changes: - Added _run_engine_loop() to create dedicated thread with event loop - Made process_from_queue() synchronous, delegating to _async version - Created sync wrappers for init_process_group, update_weight, ready, etc. - Engine thread runs continuously, ensuring background tasks get CPU time This ensures the vLLM engine's background task runs properly without event loop starvation, fixing the hanging issue after step 1. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Revert "Run AsyncLLMEngine in separate thread to fix event loop issues" This reverts commit 127c361. * Simplify generate_one_completion by removing polling wrapper The polling wrapper with asyncio.create_task was adding unnecessary complexity. Going back to the simple async for loop to let vLLM's generator work naturally. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]> * Isolate async code to prefetch threads using queue-based approach - Added thread-safe completion queue for processed results - Modified _process_request to check completion and push to queue - Converted process_from_queue to synchronous method - Removed obsolete _check_and_process_completed_requests method - Run async components in dedicated thread with own event loop - Use standard queue operations for synchronous result pulling This approach avoids event loop starvation issues by ensuring vLLM's AsyncLLMEngine runs in its own dedicated thread. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix AsyncLLMEngine initialization in dedicated thread - Create AsyncLLMEngine inside the thread's event loop in __init__ - Remove lazy initialization methods (_ensure_engine_initialized, etc) - Ensure AsyncLLMEngine is created in the same event loop where it's used - Use threading.Event to wait for initialization completion - Keep thread-safe queue for result communication This fixes the hanging issue by ensuring vLLM's AsyncLLMEngine is created and used in the same event loop context. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix AsyncLLMEngine initialization timeout for tool scripts - Increase initialization timeout from 30s to 120s for tool scripts - Add detailed logging for each initialization step - Move prefetch task creation after init complete signal - Add exc_info=True to error logging for better debugging This fixes the timeout issue that was occurring with tool scripts which need more time to initialize due to additional resources. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix AsyncLLMEngine double event loop conflict The hanging issue was caused by creating two separate event loops: - Our thread created its own event loop with run_forever() - AsyncLLMEngine with start_engine_loop=True created another background thread/loop This caused a conflict where async operations couldn't execute properly. Fix: Set start_engine_loop=False so AsyncLLMEngine uses our managed event loop instead of creating its own. This eliminates the conflict and allows proper async execution. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix AsyncLLMEngine creation to happen inside async context The issue was that AsyncLLMEngine with start_engine_loop=False needs to be created inside an async context, not synchronously. Fix: - Create an async setup() function inside the thread - Use run_until_complete(setup()) to create engine in async context - Then run_forever() to keep the loop running This ensures AsyncLLMEngine is properly initialized in the correct async context before we start using it. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * No more async def methods. * Fixed loop. * Cleaned up process_from_queue * Fixed engine init * Fixed background loop init. * Fixed bug * Fixes to code * set infligth updates to false * Add assert_threaded_actor to check Ray actor type * Fix: Use threaded Ray actors instead of async actors The issue was that Ray actors default to async actors with uvloop event loop, which conflicts with our custom event loop setup in AsyncLLMEngine. Single GPU mode worked because it had different timing, but multi-node consistently failed. Solution: Use max_concurrency=1000 option to create threaded actors that don't have event loops, avoiding the conflict entirely. * Remove async method to force threaded actor Ray creates an AsyncActor if ANY async def method exists in the class. Removed unused _q_get_async method to ensure LLMRayActor is a threaded actor, avoiding event loop conflicts with AsyncLLMEngine. * Add debug logging to identify async methods * Add more details to event loop assertion error * Update code * Fix event loop detection - use get_running_loop instead of get_event_loop The issue was that asyncio.get_event_loop() in Python 3.10+ creates a loop if none exists, so it always succeeded. We need to check for a RUNNING loop with get_running_loop() which only succeeds in async actors. * Updated timeout * Set NCCL_DEBUG to warn. * Less logs. * more logging * Added memray dep * Moved logging * fixed logging * Updated logging. * Fix CUDA OOM: Initialize AsyncLLMEngine in main thread The OOM error was caused by AsyncLLMEngine being initialized in a separate thread, creating CUDA context isolation. When update_weight() tried to allocate memory from the main thread, it couldn't access the memory pool from the async thread. Fix: Initialize AsyncLLMEngine in the main thread to ensure all CUDA operations happen in the same context. Only the event loop runs in a separate thread now. * Fix CUDA OOM: Run collective_rpc operations in event loop thread The OOM error was caused by CUDA context isolation between threads. When AsyncLLMEngine is created in a separate thread, all CUDA operations including model weights and KV cache allocation happen in that thread's CUDA context. When update_weight() was called from the main thread (via Ray RPC), it tried to allocate memory in a different CUDA context, causing apparent OOM even though memory was available in the other thread's context. Fix: Use asyncio.run_coroutine_threadsafe() to execute all collective_rpc operations (init_process_group, update_weight, update_weight_cuda_ipc) in the event loop thread where the engine and CUDA context were created. This ensures all CUDA operations happen in the same thread/context while maintaining the async functionality of AsyncLLMEngine. * now we use collective_rpc * Undid changes * Fix vLLM hanging after weight updates by running collective_rpc in thread pool The issue was that collective_rpc operations (which are blocking) were being run in the event loop thread, blocking the AsyncLLMEngine's background loop from processing queued requests. Changes: - Run collective_rpc via loop.run_in_executor() to avoid blocking the event loop - Apply fix to init_process_group, update_weight, and update_weight_cuda_ipc - Use functools.partial to properly pass keyword arguments to collective_rpc This allows the async engine to continue processing requests during weight updates. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix CUDA context: Use call_soon_threadsafe instead of run_in_executor Previous fix (b271bd0) broke CUDA context by running collective_rpc in a ThreadPoolExecutor worker thread instead of the event loop thread. Root cause: collective_rpc MUST run on the same thread as AsyncLLMEngine (the event loop thread) to maintain CUDA context. Using run_in_executor moved it to a different thread, breaking CUDA operations and causing vLLM engines to hang after ~46 completions. Solution: Use loop.call_soon_threadsafe() to schedule collective_rpc as a synchronous callback on the event loop thread. This: - Maintains correct CUDA context (runs in event loop thread) - Blocks the event loop during weight updates (necessary for CUDA) - Properly handles exceptions and return values via threading.Event Changes: - Replace run_coroutine_threadsafe + run_in_executor pattern - Use call_soon_threadsafe with threading.Event for synchronization - Apply to init_process_group, update_weight, and update_weight_cuda_ipc 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Move threading import to top of file 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add diagnostic logging for AsyncLLMEngine background loop Add comprehensive logging to diagnose why the AsyncLLMEngine background loop stops working after weight updates: 1. update_weight(): Log when collective_rpc starts/ends, log background loop task state before/after the blocking call 2. generate_one_completion(): Log every 10 seconds if stuck waiting for outputs, log error if generator exits without producing output 3. _prefetch_requests(): Periodically check and log background loop task health (every 5 seconds), detect if task is done/cancelled/failed This will help identify: - Is the background loop task still alive after blocking? - Does it become done/cancelled during the collective_rpc? - Are generators stuck waiting or exiting early? - What's the exact state transition during weight sync? 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix attribute name for background loop task diagnostic Fix typo: use _background_loop_unshielded instead of _background_loop_unshielded_task to match vLLM v0.9.1's actual attribute name. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Revert to simple collective_rpc implementation Revert all async/threading complexity for collective_rpc operations. The root cause was that blocking the event loop thread with call_soon_threadsafe creates orphaned async tasks - the AsyncLLMEngine background loop gets stuck at await points and never recovers. Changes: - Remove call_soon_threadsafe + threading.Event approach - Remove diagnostic logging - Revert to simple: directly call collective_rpc from Ray RPC handler thread - This was the original implementation before commit ed66c24 This may cause CUDA OOM if context isolation is still an issue, but we'll test if that actually occurs consistently or if it was a flaky/different problem. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Set max concurrency = 1 * longer timeout * Add AsyncLLMEngine background loop health check Check if the AsyncLLMEngine background loop is running at the start of process_from_queue() and restart it if needed. Also assert that the background loop task is not done/cancelled to catch issues early. This fixes a potential issue where the background loop stops after weight sync, causing inference to hang indefinitely. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Added check that tasks aren't dying. * Update max concurrency * Added nccl debug * added codex todo * Fixed error --------- Co-authored-by: Claude <[email protected]> * Removed debugging code * Ran linter * Fixed issue. * Minimized differences between new code and old. * Fixed cluster warning in large_test_script.sh. * Cleaned up PR. * Updated assert threaded actor class. * Fixed class. * Set default values for large_test_script.sh * set enforce eager * now, we set infligth updates false * Now, we don't set enforce eager. * Updated large_test_script.sh * Fixed env var issue * Now we set inflight updates true * trying to start/stop background loop * Removed start of loop * now, we use sleep/wake_up to make things work. * Set inflight true on single * Removed sleep/wakeup code * Updated code * Fixed typo * Ran linter * Fixed bug * switched to use the v1 engine. * updated code * Fixed issue where we were calling v0 APIs. * Fixed hanging issue * Updated code to remove pause_generation calls. * updated code * Fixed abort issue * updated code * Add diagnostic logging and fix vLLM v1 compatibility - Switch from add_request() to generate() to properly start output handler - Remove break statement to prevent spurious request aborts - Add exception handling and diagnostic logging to debug hanging issues - Fix collective_rpc calls to use async versions with run_coroutine_threadsafe 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Set vllm logs to debug * Updated vllm version to 10.2. * Updated flash attn version * Ran uv sync * Fix AsyncLLMEngine hanging by creating it within running event loop The AsyncLLMEngine was being created before the event loop started running, which prevented the output handler from starting automatically. This caused all generation requests to hang indefinitely. Fixed by creating the engine from within the running loop context using run_until_complete(), ensuring AsyncLLM.__init__ detects the running loop and starts the output handler. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Move _init_engine_async to module-level function The async method was triggering the threaded actor assertion that prevents async methods on the Ray actor class. Moving it to a free function resolves this while maintaining the same functionality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add comprehensive diagnostic logging for async task tracing - Add logging to verify AsyncLLMEngine output handler initialization - Track async task lifecycle (entry/exit) in process_request_async() - Monitor async generator iteration in generate_one_completion() - Add periodic event loop health monitor (every 30s) - Log pending tasks to identify hung requests This will help trace why some requests (train_1_44, train_1_3, etc.) hang while others complete successfully. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add diagnostic logging to trace process_from_queue exit behavior - Log _should_exit() decision-making with stop status and pending work counts - Track process_from_queue() while loop entry/exit and iteration count - Log completion_queue operations (get success, empty exceptions) - This will reveal why the loop exits early leaving items in the queue 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add diagnostic logging for weight sync stop_requested toggle Added INFO level logging to trace the weight sync stop/resume cycle: - Log before/after setting should_stop to True - Log entry to try block and completion of weight broadcast - Log entry to finally block before setting should_stop to False - Log in ActorManager.set_should_stop to confirm state changes This will help diagnose why stop_requested stays True forever instead of toggling back to False after weight sync completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add diagnostic logging to trace weight broadcast deadlock Added INFO level logging to: 1. update_weight() in vllm_utils3.py: - Log entry with param name/dtype/shape - Log async task scheduling - Log waiting for result - Log result completion 2. broadcast_to_vllm() in grpo_fast.py: - Log entry with rank - Log total parameters to broadcast - Log each parameter being processed (name, shape, count) - Log update_weight.remote() future creation - Log before/after torch.distributed.broadcast - Log exit with future count This will reveal exactly where the deadlock occurs: - If update_weight() logs don't appear, Ray actor isn't receiving calls - If "Async task scheduled" appears but no result, event loop isn't processing - If broadcast doesn't complete, torch.distributed collective is deadlocked 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add event loop diagnostic logging to update_weight Added logging to verify which event loops are being used: - self.loop reference - self.loop.is_running() status - asyncio.get_event_loop() reference - Future object details This will help identify if there's an event loop mismatch causing the asyncio.run_coroutine_threadsafe() call to fail to schedule the task. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix event loop mismatch in async RPC calls The deadlock was caused by an event loop mismatch: - self.loop (running in dedicated thread) vs - asyncio.get_event_loop() in calling thread (not running) When collective_rpc_async() was scheduled via run_coroutine_threadsafe, it ran on the correct loop but internally used the wrong loop context. Fix: Call asyncio.set_event_loop(self.loop) before run_coroutine_threadsafe in all methods that use it: - update_weight() - update_weight_cuda_ipc() - init_process_group() - reset_prefix_cache() This ensures asyncio.get_event_loop() inside the coroutines returns the correct running loop. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add assertions to verify event loop consistency Added assertions and logging to catch event loop mismatches: 1. In _init_engine_async(): - Assert asyncio.get_running_loop() == actor.loop - Log both loops to verify engine is created on correct loop 2. In update_weight(): - Log self.loop and its running status - After set_event_loop, assert get_event_loop() == self.loop - Catch any loop mismatch before async call These assertions will fail fast if there's a loop inconsistency, helping identify exactly where the problem occurs. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix async RPC deadlock by using sync-to-async bridge The deadlock occurred because collective_rpc_async() internally calls asyncio.get_running_loop().create_future(), which requires execution within the event loop thread. Using run_coroutine_threadsafe() from external threads doesn't provide the right execution context. Solution: Add sync-to-async bridge pattern: - Add _sync_request_queue for cross-thread communication - Add _process_sync_requests() async task in event loop - Add _call_async_from_sync() helper for sync methods - Update init_process_group(), update_weight(), update_weight_cuda_ipc(), and reset_prefix_cache() to use the bridge This matches PipelineRL's architecture where async work happens in the event loop thread context, but keeps external API synchronous. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Tried more fixes * Updated to remove generate thread * Updated code to add processing * Chnage architecture * Now we set vllm_insecure * Set inflight false * removed message serialization * removed some logs * Another attempt to fix hang * lots of logging changes * Ran linter. * Reset scripts. * Undid changes to mason.py * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Cleaned up PR. * Removed timeouterrro * Cleaned up PR * Uses async for * Now, we handle tools. * Cleaned assert code. * Attempty at fixing code. * Cleaned up assert * Another attempt at fixing the bug * Updated code * Fix tool execution hanging by using dedicated executor instead of asyncio.to_thread The issue was that asyncio.to_thread uses a default thread pool with limited workers (min(32, cpu_count + 4)). With 32 concurrent requests (8 prompts x 4 samples), this could exhaust the thread pool and cause deadlock. Now using the dedicated ThreadPoolExecutor with 20 workers that's already created for tool execution, matching the behavior of the main branch. * Fix async event loop issue - use get_running_loop() instead of get_event_loop() The issue was that process_request_async runs in a specific event loop via asyncio.run_coroutine_threadsafe, but we were calling get_event_loop() which could return a different loop or fail in a thread context. asyncio.to_thread internally uses get_running_loop(), so we must do the same when using run_in_executor to ensure we use the correct event loop. * Add logging to check if executor is None during tool execution * Add detailed logging to track tool execution and triggering * Fix async event loop hanging by using unique request IDs for each iteration The issue was that after a tool call, we would loop back and try to generate again with the same sub_request_id. This caused vLLM to reject or hang on the duplicate request ID. Now we append '_iterN' to create unique IDs for each generation attempt within the same request. * Add detailed logging to trace async generation hang * Add detailed logging after tool execution to trace iteration hang * Add fine-grained logging to debug model config access hang * Add detailed logging to debug prompt concatenation hang * Cache prompt_token_ids to avoid hang when accessing TokensPrompt property * Fix undefined variable in assert_threaded_actor * Updated code * Set inflight false * Fixed duplicate flag * Simplified significantly * Removed logs * Simplified threading model * Added handling for inflight_updates * Inlined generate_one_completion * Clean up * More clean up * Set inflight to true * Cleaned up code. * lots of cleanup * Major refactor * More PR cleanup * Fixed code * Updated script * Cleaned up PR --------- Co-authored-by: Claude <[email protected]> Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.