Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

No description provided.

saurabh111233212 and others added 6 commits September 9, 2025 18:25
* adding PuzzleMatcherVerifier

* Update ground_truth_utils.py

* Create test_ground_truth_utils.py

* Update ground_truth_utils.py

* Update ground_truth_utils.py

* lint

* lint, address gemini

* Puzzle -> puzzle

* Fixed tests.

* Cleaned up test

* Cleaned up test

---------

Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>
* fixes for manual eval jobs with new cluster names

* fix default cluster names
* Disables cascade attention.

* fix issue with creation of LLMEngine.
@finbarrtimbers finbarrtimbers changed the base branch from main to backfill-prompts September 14, 2025 16:19
@finbarrtimbers finbarrtimbers changed the base branch from backfill-prompts to main September 14, 2025 16:19
finbarrtimbers and others added 15 commits September 14, 2025 21:23
…s assignment

The removal of asserts accidentally removed the line that stores output_copy
in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures
tries to access the tracking data for tool processing.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
- Removed unused 'prefix' variable in add_request function
- Removed unused 'request_id' variable in _prefetch_worker
- Removed unused 'training_step' variable in _maybe_process_and_insert
- Removed unused 'context' variable in _poll_tool_futures

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>
* Changes have been made.

* Updated function signature of _finalize_outputs.

* Added tests for make_request_id.

* Deleted old function.

* Cleaned up test

* Removed spurious except clause

* Fixed bug where we were passing the wrong sampling params to _finalize_outputs.

* Fixed tests.

* Ran linter.
@finbarrtimbers finbarrtimbers marked this pull request as ready for review September 15, 2025 17:11
@finbarrtimbers finbarrtimbers changed the base branch from main to backfill-prompts September 15, 2025 17:16
@finbarrtimbers finbarrtimbers merged commit fb9344b into backfill-prompts Sep 15, 2025
4 checks passed
@finbarrtimbers finbarrtimbers deleted the backfill-prompts-clean branch September 15, 2025 17:17
@finbarrtimbers finbarrtimbers restored the backfill-prompts-clean branch September 15, 2025 17:21
github-merge-queue bot pushed a commit that referenced this pull request Sep 16, 2025
…so that we are continually processing prompts. (#998)

* PAsses single prompts through.

* Updated queue_types.py to match.

* Fixed issue with code.

* Fixed queue sizing issue.

* Updated length of tool use experiment.

* Merged conflicts

* Corrected inference batch size calculation.

* Fixed merge errors.

* Undid changes ot test file.

* UNdid changes.

* Cleaned up code

* Another attempt to fix the dataset_index bug.

* Another attempt to fix the dataset_index bug.

* Added assert statements.

* Removed debugging code.

* Less changes

* Undid changes

* Cleaned up PR.

* Fixed change in sort order

* Clean up PR

* Cleaned up code.

* Cleaned up PR.

* Fixed issue.

* Fixed linter errors.

* Updated tool_grpo_fast.sh to use new workspace.

* Removed redundant test.

* Added back whitespace.

* Ran linter.

* Refactored code.

* Cleaned up PR.

* Fixed linter error.

* Removed logging.

* Removed logging statement.

* Attempt at fix mask mismatch issue.

* Tests should pass now.

* Updated timing code.

* Ran linter.

* Added timing.

* Timing is fast now.

* Remove timing instrumentation code

- Remove detailed timing instrumentation from accumulate_inference_batches
- Remove enhanced Timer class duration property and initialization timing
- Remove process_from_queue timing breakdown logging
- Preserve all functional changes including single-prompt processing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Added lots of debugging statements.

* Ran linter. Fixed bug.

* Added test file

* Removed whitespace

* Updated script.

* Cleaned up code.

* Removed debugging code.

* Fixed failing test.

* Set timeout for tests. They should take 5 minutes to run.

* Now, tests should pass.

* now tests should pass.

* Linter passes.

* Now, stream prompts into and out of the engine.

* now, tests should pass

* now, tests should actually pass

* now, tests should actually pass

* Cleaned up the way we do streaming.

* Ran linter.

* Undid changes to test.

* Refactored code.

* Fixed bug.

* Removed comment.

* Cleaned up PR.

* Removed second fill_engine call.

* Fixed linter error.

* updated code

* Attempt at fix from gpt5-pro.

* Updated timing code

* Updated timing code

* Tons of debugging

* Lots of changes, debugging, seems to run but slow?

* Remove debugging/logging code from actor_manager.py and grpo_fast.py

- Cleaned up excessive logging in actor_manager setup functions
- Removed debug timing code from load_data_from_packing_thread
- Restored clean function signatures and error handling

* Remove profiling and hanging operation debug instrumentation from vllm_utils3.py

- Removed ProfilerContext class and cProfile imports
- Removed warn_if_slow and warn_if_slow_hanging_operation decorators
- Removed HANGING_OPERATION_WARNING_TIMEOUT_S constant
- Cleaned up method calls wrapped with timing decorators

* Cleaned up PR

* Added back removed functionality.

* Cleaned up PR.

* Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations

* Revert "Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations"

This reverts commit 6d4d027.

* Adds a prefetch worker

* removed notify

* Added test, fixed prefetch bug

* Added verbose logging guard

* Cleaned up implementation

* Lots of help with prefetch thread

* Now, we refill on every step, and block on calls to insert data.

* Now, large_test_script.sh successfully completes.

* Updated prefetch thread to enqueue requests automatically.

* Added retries with backoff for queue operations

* Set reraise=True.

* Updated code to catch specific errors

* Fixed syntaxerror.

* now, tests pass

* Removed robust queue operations

* Updated code to clean it up

* Updatd code

* Removed uv.lock

* Removed fill_engine.

* Tried to fix error in tool use

* A bunch of attempted fixes.

* Updated code ot catch issue

* Attempt to fix resolution issue.

* made changes suggested by ChatGPT

* made chatgpt suggestions

* Claude suggested some changes to assertions.

* Another attempt at fixing by adding comprehensive assertions

* made a bunch of changes

* Now, processing persists across calls to process_from_queue

* Added local tool script

* Updated code

* Lots of changes

* Fixed exit condition

* Cleaned up code

* Added a fix to make sure we don't process a request until all tool futures are complete.

* Fixed bug where we were accidentally adding unfinished requests to request_outputs.

* Updated script so we can't launch experiments with a dirty repo.

* fixed bug

* mdofied script

* Attempt at making the flow correct

* Added additional tracking.

* Added more logging

* Fixed bug that was incorrectly considering a request complete.

* Checks for active vllm continuations

* Fixed sub-requests

* Added additional assertions to track bug down.

* Updated code to add validation

* Added more validation.

* Fixed error in validation

* Added more validation.

* Fixed error in validation.

* Added more validation.

* Fixed validations.

* updated validation.

* Fixed a bug with processing. Added a function to handle finishing a request.

* Moved where we call validation to be in a more appropriate place.

* Fixed validation to be more comprehensive.

* Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput].

* Fixed else clause.

* Fixed bug where we were modifying the tracking dict while iterating over it.

* Fixed validation

* Removed redundant safety check.

* Added more logging.

* Attempt at fixing issue

* Moves delete

* Fixed bug in tool code.

* Ran linter. Tool script seems to run fine.

* Fixed bug in non-tool use.

* Updated code to fix index bug.

* Undid host network change to tool_grpo_fast.sh.

* Fixed no host networking command from tool_grpo_fast.sh.

* Fixed incorrect index field.

* Fixed index

* Fix request tracking and training_step propagation for non-tool use

- Use dataclasses.replace to properly set CompletionOutput.index when tools disabled
- Make request_tracking keys unique by combining training_step and dataset_index
- Pass training_step through to GenerationResult to enable proper tracking key construction

This fixes the hang that occurred when tools were disabled due to:
1. Incorrect index values preventing output matching
2. Duplicate dataset indices across training steps causing prefetch thread crash

* Fix KeyError by accessing training_step before cleanup

The recent commit changed the tracking system to use combined keys
(training_step_dataset_index), but there was a bug where we tried to
access request_metadata[request_id]["training_step"] after calling
_cleanup_request_data, which removes the metadata.

This fix gets the training_step value before calling _cleanup_request_data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add assertion to verify all n sub-requests are tracked

This assertion will help debug the hanging issue in single_gpu_on_beaker.sh
by ensuring that all n sub-requests are properly added to vllm_active_requests.

If the assertion fails, it means some sub-requests are not being tracked
properly after being added to vLLM.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix index field timing issue in non-tool mode with n>1

Move index field assignment before adding CompletionOutput to request_outputs
to ensure _maybe_process_and_insert can properly identify sub-requests.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix CompletionOutput index for non-tool mode with n>1

When tools are disabled and n>1, fix the index field immediately when
receiving outputs from vLLM. Each sub-request gets the correct index
extracted from its ID (e.g., train_3_12_1 -> index=1).

Added assertions to verify the fix is working correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Changed loop

* Set verbose to true

* Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor.

* changed removal from vllm_active_requests to occur after _finalize

* Fix sub-request tracking race condition in non-tools mode

The issue occurred when tools were disabled due to premature removal
of sub-requests from vllm_active_requests. This caused validation
failures where expected sub-requests couldn't be found.

The fix ensures sub-requests are removed from vllm_active_requests
at the correct time - after determining they won't go to tools but
before calling _finalize_sub_request to avoid deadlock in
_maybe_process_and_insert.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removes debugging code from `backfill-prompts` (#1008)

* remove flash infer from dependencies, mason, and benchmark script (#1000)

* update eval names (#1001)

* Puzzle verifier (#1003)

* adding PuzzleMatcherVerifier

* Update ground_truth_utils.py

* Create test_ground_truth_utils.py

* Update ground_truth_utils.py

* Update ground_truth_utils.py

* lint

* lint, address gemini

* Puzzle -> puzzle

* Fixed tests.

* Cleaned up test

* Cleaned up test

---------

Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>

* fixes for manual eval jobs with new cluster names (#1002)

* fixes for manual eval jobs with new cluster names

* fix default cluster names

* Disables cascade attention. (#1006)

* Disables cascade attention.

* fix issue with creation of LLMEngine.

* whoops wrong name (#1007)

* Clean up

* Removed request_tracking.

* Removed tracking key call

* Fixed attribute error

* Removed tracking key

* Removed asserts

* Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment

The removal of asserts accidentally removed the line that stores output_copy
in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures
tries to access the tracking data for tool processing.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix linter errors: remove unused variables

- Removed unused 'prefix' variable in add_request function
- Removed unused 'request_id' variable in _prefetch_worker
- Removed unused 'training_step' variable in _maybe_process_and_insert
- Removed unused 'context' variable in _poll_tool_futures

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Undid local changes.

* Fixed accidental deletion

* Removed copy

* Cleaned function signatures.

* Cleaned up finalize_outputs

* Added script. (#1009)

* Changed vllm_active_requests into a set.

* Lots of clean up.

* Cleaned up _maybe_process_and_insert significantly.

* Attempt at fixing empty response issue.

* Fixed errors and testing with fake asserts. (#1010)

* Small refactor to prepare for the `backfill-prompts` PR (#1011)

* Changes have been made.

* Updated function signature of _finalize_outputs.

* Added tests for make_request_id.

* Deleted old function.

* Cleaned up test

* Removed spurious except clause

* Fixed bug where we were passing the wrong sampling params to _finalize_outputs.

* Fixed tests.

* Ran linter.

* attempted to fix bug in completions.

* Removed _finalize_outputs, renamed process_outputs.

* Check we have enough data for bsz/prefill (#1012)

* check that we have enough data

* lint

* comments

* update comment

* minor update

* Cleaned up code significantly.

* Cleaned up tests.

* Combine process_output and _process_completed_request into process_completed_request function.

* Removed duplicate definition of _extract_base_request_id.

* Fixed linter error.

---------

Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
Co-authored-by: Claude <[email protected]>

* Fixed linter error.

* Removed _has_Pending_engine_requests_for-base_id method.

* Used a dictionary instead of a loop.

---------

Co-authored-by: Claude <[email protected]>
Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Sep 17, 2025
* PAsses single prompts through.

* Updated queue_types.py to match.

* Fixed issue with code.

* Fixed queue sizing issue.

* Updated length of tool use experiment.

* Merged conflicts

* Corrected inference batch size calculation.

* Fixed merge errors.

* Undid changes ot test file.

* UNdid changes.

* Cleaned up code

* Another attempt to fix the dataset_index bug.

* Another attempt to fix the dataset_index bug.

* Added assert statements.

* Removed debugging code.

* Less changes

* Undid changes

* Cleaned up PR.

* Fixed change in sort order

* Clean up PR

* Cleaned up code.

* Cleaned up PR.

* Fixed issue.

* Fixed linter errors.

* Updated tool_grpo_fast.sh to use new workspace.

* Removed redundant test.

* Added back whitespace.

* Ran linter.

* Refactored code.

* Cleaned up PR.

* Fixed linter error.

* Removed logging.

* Removed logging statement.

* Attempt at fix mask mismatch issue.

* Tests should pass now.

* Updated timing code.

* Ran linter.

* Added timing.

* Timing is fast now.

* Remove timing instrumentation code

- Remove detailed timing instrumentation from accumulate_inference_batches
- Remove enhanced Timer class duration property and initialization timing
- Remove process_from_queue timing breakdown logging
- Preserve all functional changes including single-prompt processing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Added lots of debugging statements.

* Ran linter. Fixed bug.

* Added test file

* Removed whitespace

* Updated script.

* Cleaned up code.

* Removed debugging code.

* Fixed failing test.

* Set timeout for tests. They should take 5 minutes to run.

* Now, tests should pass.

* now tests should pass.

* Linter passes.

* Now, stream prompts into and out of the engine.

* now, tests should pass

* now, tests should actually pass

* now, tests should actually pass

* Cleaned up the way we do streaming.

* Ran linter.

* Undid changes to test.

* Refactored code.

* Fixed bug.

* Removed comment.

* Cleaned up PR.

* Removed second fill_engine call.

* Fixed linter error.

* updated code

* Attempt at fix from gpt5-pro.

* Updated timing code

* Updated timing code

* Tons of debugging

* Lots of changes, debugging, seems to run but slow?

* Remove debugging/logging code from actor_manager.py and grpo_fast.py

- Cleaned up excessive logging in actor_manager setup functions
- Removed debug timing code from load_data_from_packing_thread
- Restored clean function signatures and error handling

* Remove profiling and hanging operation debug instrumentation from vllm_utils3.py

- Removed ProfilerContext class and cProfile imports
- Removed warn_if_slow and warn_if_slow_hanging_operation decorators
- Removed HANGING_OPERATION_WARNING_TIMEOUT_S constant
- Cleaned up method calls wrapped with timing decorators

* Cleaned up PR

* Added back removed functionality.

* Cleaned up PR.

* Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations

* Revert "Cleans up code and makes sure that we don't call fill_engine more than every 100 iterations"

This reverts commit 6d4d027.

* Adds a prefetch worker

* removed notify

* Added test, fixed prefetch bug

* Added verbose logging guard

* Cleaned up implementation

* Lots of help with prefetch thread

* Now, we refill on every step, and block on calls to insert data.

* Now, large_test_script.sh successfully completes.

* Updated prefetch thread to enqueue requests automatically.

* Added retries with backoff for queue operations

* Set reraise=True.

* Updated code to catch specific errors

* Fixed syntaxerror.

* now, tests pass

* Removed robust queue operations

* Updated code to clean it up

* Updatd code

* Removed uv.lock

* Removed fill_engine.

* Tried to fix error in tool use

* A bunch of attempted fixes.

* Updated code ot catch issue

* Attempt to fix resolution issue.

* made changes suggested by ChatGPT

* made chatgpt suggestions

* Claude suggested some changes to assertions.

* Another attempt at fixing by adding comprehensive assertions

* made a bunch of changes

* Now, processing persists across calls to process_from_queue

* Added local tool script

* Updated code

* Lots of changes

* Fixed exit condition

* Cleaned up code

* Added a fix to make sure we don't process a request until all tool futures are complete.

* Fixed bug where we were accidentally adding unfinished requests to request_outputs.

* Updated script so we can't launch experiments with a dirty repo.

* fixed bug

* mdofied script

* Attempt at making the flow correct

* Added additional tracking.

* Added more logging

* Fixed bug that was incorrectly considering a request complete.

* Checks for active vllm continuations

* Fixed sub-requests

* Added additional assertions to track bug down.

* Updated code to add validation

* Added more validation.

* Fixed error in validation

* Added more validation.

* Fixed error in validation.

* Added more validation.

* Fixed validations.

* updated validation.

* Fixed a bug with processing. Added a function to handle finishing a request.

* Moved where we call validation to be in a more appropriate place.

* Fixed validation to be more comprehensive.

* Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput].

* Fixed else clause.

* Fixed bug where we were modifying the tracking dict while iterating over it.

* Fixed validation

* Removed redundant safety check.

* Added more logging.

* Attempt at fixing issue

* Moves delete

* Fixed bug in tool code.

* Ran linter. Tool script seems to run fine.

* Fixed bug in non-tool use.

* Updated code to fix index bug.

* Undid host network change to tool_grpo_fast.sh.

* Fixed no host networking command from tool_grpo_fast.sh.

* Fixed incorrect index field.

* Fixed index

* Fix request tracking and training_step propagation for non-tool use

- Use dataclasses.replace to properly set CompletionOutput.index when tools disabled
- Make request_tracking keys unique by combining training_step and dataset_index
- Pass training_step through to GenerationResult to enable proper tracking key construction

This fixes the hang that occurred when tools were disabled due to:
1. Incorrect index values preventing output matching
2. Duplicate dataset indices across training steps causing prefetch thread crash

* Fix KeyError by accessing training_step before cleanup

The recent commit changed the tracking system to use combined keys
(training_step_dataset_index), but there was a bug where we tried to
access request_metadata[request_id]["training_step"] after calling
_cleanup_request_data, which removes the metadata.

This fix gets the training_step value before calling _cleanup_request_data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add assertion to verify all n sub-requests are tracked

This assertion will help debug the hanging issue in single_gpu_on_beaker.sh
by ensuring that all n sub-requests are properly added to vllm_active_requests.

If the assertion fails, it means some sub-requests are not being tracked
properly after being added to vLLM.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix index field timing issue in non-tool mode with n>1

Move index field assignment before adding CompletionOutput to request_outputs
to ensure _maybe_process_and_insert can properly identify sub-requests.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix CompletionOutput index for non-tool mode with n>1

When tools are disabled and n>1, fix the index field immediately when
receiving outputs from vLLM. Each sub-request gets the correct index
extracted from its ID (e.g., train_3_12_1 -> index=1).

Added assertions to verify the fix is working correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Changed loop

* Set verbose to true

* Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor.

* changed removal from vllm_active_requests to occur after _finalize

* Fix sub-request tracking race condition in non-tools mode

The issue occurred when tools were disabled due to premature removal
of sub-requests from vllm_active_requests. This caused validation
failures where expected sub-requests couldn't be found.

The fix ensures sub-requests are removed from vllm_active_requests
at the correct time - after determining they won't go to tools but
before calling _finalize_sub_request to avoid deadlock in
_maybe_process_and_insert.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removes debugging code from `backfill-prompts` (#1008)

* remove flash infer from dependencies, mason, and benchmark script (#1000)

* update eval names (#1001)

* Puzzle verifier (#1003)

* adding PuzzleMatcherVerifier

* Update ground_truth_utils.py

* Create test_ground_truth_utils.py

* Update ground_truth_utils.py

* Update ground_truth_utils.py

* lint

* lint, address gemini

* Puzzle -> puzzle

* Fixed tests.

* Cleaned up test

* Cleaned up test

---------

Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>

* fixes for manual eval jobs with new cluster names (#1002)

* fixes for manual eval jobs with new cluster names

* fix default cluster names

* Disables cascade attention. (#1006)

* Disables cascade attention.

* fix issue with creation of LLMEngine.

* whoops wrong name (#1007)

* Clean up

* Removed request_tracking.

* Removed tracking key call

* Fixed attribute error

* Removed tracking key

* Removed asserts

* Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment

The removal of asserts accidentally removed the line that stores output_copy
in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures
tries to access the tracking data for tool processing.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix linter errors: remove unused variables

- Removed unused 'prefix' variable in add_request function
- Removed unused 'request_id' variable in _prefetch_worker
- Removed unused 'training_step' variable in _maybe_process_and_insert
- Removed unused 'context' variable in _poll_tool_futures

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Undid local changes.

* Fixed accidental deletion

* Removed copy

* Cleaned function signatures.

* Cleaned up finalize_outputs

* Added script. (#1009)

* Changed vllm_active_requests into a set.

* Lots of clean up.

* Cleaned up _maybe_process_and_insert significantly.

* Attempt at fixing empty response issue.

* Fixed errors and testing with fake asserts. (#1010)

* Small refactor to prepare for the `backfill-prompts` PR (#1011)

* Changes have been made.

* Updated function signature of _finalize_outputs.

* Added tests for make_request_id.

* Deleted old function.

* Cleaned up test

* Removed spurious except clause

* Fixed bug where we were passing the wrong sampling params to _finalize_outputs.

* Fixed tests.

* Ran linter.

* attempted to fix bug in completions.

* Removed _finalize_outputs, renamed process_outputs.

* Check we have enough data for bsz/prefill (#1012)

* check that we have enough data

* lint

* comments

* update comment

* minor update

* Cleaned up code significantly.

* Cleaned up tests.

* Combine process_output and _process_completed_request into process_completed_request function.

* Removed duplicate definition of _extract_base_request_id.

* Fixed linter error.

---------

Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
Co-authored-by: Claude <[email protected]>

* Fixed linter error.

* Add inflight_updates argument to enable quick pausing/resumption

- Added inflight_updates boolean argument to Args class in grpo_fast.py
- Passed inflight_updates through to vllm_utils3.create_vllm_engines
- Modified LLMRayActor to accept and handle inflight_updates mode
- When inflight_updates is True, process_from_queue returns immediately when should_stop is set
- When inflight_updates is False, maintains existing behavior (waits for pending work)

This allows for quick pausing and resumption of request processing during model updates.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Refactor process_from_queue loop to use _should_exit method

- Added _should_exit method to encapsulate all exit logic
- Uses early returns with no nested conditions for clarity
- Optimizes by checking expensive operations only when needed
- Removed duplicate exit condition checks from the main loop
- Simplified while loop to use while not self._should_exit()

This makes the code more maintainable and the exit conditions clearer.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removed comment.

* Longer benchmark

* undid changes to benchmark

* Removed _has_Pending_engine_requests_for-base_id method.

* Used a dictionary instead of a loop.

* Added simulated weight sync to code

* Now, we set inflight updates to be true

* Updated timing

* Removed inflight updates

* Made more reproducible

* removed inflight updates

* Ran linter, merged changes, and set temperature back to 1.0

* Fixed sampling params differences

* Set inflight updates true

---------

Co-authored-by: Claude <[email protected]>
Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
@finbarrtimbers finbarrtimbers deleted the backfill-prompts-clean branch September 18, 2025 17:51
github-merge-queue bot pushed a commit that referenced this pull request Oct 17, 2025
* Now, processing persists across calls to process_from_queue

* Added local tool script

* Updated code

* Lots of changes

* Fixed exit condition

* Cleaned up code

* Added a fix to make sure we don't process a request until all tool futures are complete.

* Fixed bug where we were accidentally adding unfinished requests to request_outputs.

* Updated script so we can't launch experiments with a dirty repo.

* fixed bug

* mdofied script

* Attempt at making the flow correct

* Added additional tracking.

* Added more logging

* Fixed bug that was incorrectly considering a request complete.

* Checks for active vllm continuations

* Fixed sub-requests

* Added additional assertions to track bug down.

* Updated code to add validation

* Added more validation.

* Fixed error in validation

* Added more validation.

* Fixed error in validation.

* Added more validation.

* Fixed validations.

* updated validation.

* Fixed a bug with processing. Added a function to handle finishing a request.

* Moved where we call validation to be in a more appropriate place.

* Fixed validation to be more comprehensive.

* Fixed bug where we were passing a RequestOutput object rather than a list[CompletionOutput].

* Fixed else clause.

* Fixed bug where we were modifying the tracking dict while iterating over it.

* Fixed validation

* Removed redundant safety check.

* Added more logging.

* Attempt at fixing issue

* Moves delete

* Fixed bug in tool code.

* Ran linter. Tool script seems to run fine.

* Fixed bug in non-tool use.

* Updated code to fix index bug.

* Undid host network change to tool_grpo_fast.sh.

* Fixed no host networking command from tool_grpo_fast.sh.

* Fixed incorrect index field.

* Fixed index

* Fix request tracking and training_step propagation for non-tool use

- Use dataclasses.replace to properly set CompletionOutput.index when tools disabled
- Make request_tracking keys unique by combining training_step and dataset_index
- Pass training_step through to GenerationResult to enable proper tracking key construction

This fixes the hang that occurred when tools were disabled due to:
1. Incorrect index values preventing output matching
2. Duplicate dataset indices across training steps causing prefetch thread crash

* Fix KeyError by accessing training_step before cleanup

The recent commit changed the tracking system to use combined keys
(training_step_dataset_index), but there was a bug where we tried to
access request_metadata[request_id]["training_step"] after calling
_cleanup_request_data, which removes the metadata.

This fix gets the training_step value before calling _cleanup_request_data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add assertion to verify all n sub-requests are tracked

This assertion will help debug the hanging issue in single_gpu_on_beaker.sh
by ensuring that all n sub-requests are properly added to vllm_active_requests.

If the assertion fails, it means some sub-requests are not being tracked
properly after being added to vLLM.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix index field timing issue in non-tool mode with n>1

Move index field assignment before adding CompletionOutput to request_outputs
to ensure _maybe_process_and_insert can properly identify sub-requests.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix CompletionOutput index for non-tool mode with n>1

When tools are disabled and n>1, fix the index field immediately when
receiving outputs from vLLM. Each sub-request gets the correct index
extracted from its ID (e.g., train_3_12_1 -> index=1).

Added assertions to verify the fix is working correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Changed loop

* Set verbose to true

* Fixed tracking key bug and added a health check for prefetch worker to kill LLMRayActor.

* changed removal from vllm_active_requests to occur after _finalize

* Fix sub-request tracking race condition in non-tools mode

The issue occurred when tools were disabled due to premature removal
of sub-requests from vllm_active_requests. This caused validation
failures where expected sub-requests couldn't be found.

The fix ensures sub-requests are removed from vllm_active_requests
at the correct time - after determining they won't go to tools but
before calling _finalize_sub_request to avoid deadlock in
_maybe_process_and_insert.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removes debugging code from `backfill-prompts` (#1008)

* remove flash infer from dependencies, mason, and benchmark script (#1000)

* update eval names (#1001)

* Puzzle verifier (#1003)

* adding PuzzleMatcherVerifier

* Update ground_truth_utils.py

* Create test_ground_truth_utils.py

* Update ground_truth_utils.py

* Update ground_truth_utils.py

* lint

* lint, address gemini

* Puzzle -> puzzle

* Fixed tests.

* Cleaned up test

* Cleaned up test

---------

Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>

* fixes for manual eval jobs with new cluster names (#1002)

* fixes for manual eval jobs with new cluster names

* fix default cluster names

* Disables cascade attention. (#1006)

* Disables cascade attention.

* fix issue with creation of LLMEngine.

* whoops wrong name (#1007)

* Clean up

* Removed request_tracking.

* Removed tracking key call

* Fixed attribute error

* Removed tracking key

* Removed asserts

* Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment

The removal of asserts accidentally removed the line that stores output_copy
in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures
tries to access the tracking data for tool processing.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix linter errors: remove unused variables

- Removed unused 'prefix' variable in add_request function
- Removed unused 'request_id' variable in _prefetch_worker
- Removed unused 'training_step' variable in _maybe_process_and_insert
- Removed unused 'context' variable in _poll_tool_futures

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Undid local changes.

* Fixed accidental deletion

* Removed copy

* Cleaned function signatures.

* Cleaned up finalize_outputs

* Added script. (#1009)

* Changed vllm_active_requests into a set.

* Lots of clean up.

* Cleaned up _maybe_process_and_insert significantly.

* Attempt at fixing empty response issue.

* Fixed errors and testing with fake asserts. (#1010)

* Small refactor to prepare for the `backfill-prompts` PR (#1011)

* Changes have been made.

* Updated function signature of _finalize_outputs.

* Added tests for make_request_id.

* Deleted old function.

* Cleaned up test

* Removed spurious except clause

* Fixed bug where we were passing the wrong sampling params to _finalize_outputs.

* Fixed tests.

* Ran linter.

* attempted to fix bug in completions.

* Removed _finalize_outputs, renamed process_outputs.

* Check we have enough data for bsz/prefill (#1012)

* check that we have enough data

* lint

* comments

* update comment

* minor update

* Cleaned up code significantly.

* Cleaned up tests.

* Combine process_output and _process_completed_request into process_completed_request function.

* Removed duplicate definition of _extract_base_request_id.

* Fixed linter error.

---------

Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
Co-authored-by: Claude <[email protected]>

* Fixed linter error.

* Add inflight_updates argument to enable quick pausing/resumption

- Added inflight_updates boolean argument to Args class in grpo_fast.py
- Passed inflight_updates through to vllm_utils3.create_vllm_engines
- Modified LLMRayActor to accept and handle inflight_updates mode
- When inflight_updates is True, process_from_queue returns immediately when should_stop is set
- When inflight_updates is False, maintains existing behavior (waits for pending work)

This allows for quick pausing and resumption of request processing during model updates.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Refactor process_from_queue loop to use _should_exit method

- Added _should_exit method to encapsulate all exit logic
- Uses early returns with no nested conditions for clarity
- Optimizes by checking expensive operations only when needed
- Removed duplicate exit condition checks from the main loop
- Simplified while loop to use while not self._should_exit()

This makes the code more maintainable and the exit conditions clearer.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removed comment.

* Longer benchmark

* undid changes to benchmark

* Now, uses the async engine.

* Fixed errors

* Removed host networking from single

* much simpler flow

* Removed tracking variable.

* Cleans up lifecycle.

* Updated generate_one_completion.

* Cleaned up main loop.

* Refactored _process_request significantly

* Simplified process_from_queue

* Updates code

* Updated clusters

* updated script

* updated script

* updated script

* Updated script

* Updated script to match launch_benchmark.sh

* Fixed bug.

* updated priority

* Fixed kv_cache_specs

* Fixed kv_cache_specs

* Added logging

* Fixed methods

* Ran linter.

* Fix blocking ray.get in async actor

Convert _should_stop to async method to avoid blocking the event loop.
This fixes the issue where "Engine is gracefully shutting down" message
repeats indefinitely due to blocked async execution.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Improve async _should_stop to prevent blocking and duplicate requests

- Track in-flight requests with _inflight_ref to prevent duplicates
- Use asyncio.wait_for instead of ray.wait for proper async handling
- Cancel Ray tasks on timeout to avoid resource leaks
- Prevents blocking the event loop and improves async performance

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Added timeouts for all the scripts

* Now, we always run the engine loop.

* Fixed engine initialization.

* added await to generator.

* Changed loggign for vllm

* Fixed logging levels

* Added more timing

* Fixed deadlock

* set logging to debug for vllm

* set logging to debug for vllm_utils3.py

* Fixed timeout bug

* an attempted fix

* Add async engine implementation for vLLM

- Replace synchronous LLMEngine with AsyncLLMEngine
- Add async request processing with asyncio tasks
- Implement non-blocking weight updates with inflight_updates flag
- Add proper async handling for tool calls
- Update debug scripts with timeouts

* Attempt at fixing

* Add detailed logging to track vLLM generation hang

* fixed error

* Fixed wait stalling

* fix issues

* Add comprehensive logging to debug queue hang issue

- Added detailed logging to track queue operations in vllm_utils3.py
- Added logging for data flow in grpo_fast.py
- Track when results are put/get from Ray queues
- Log queue objects to identify routing issues
- Add iteration counters and request details

This will help identify where the data flow is getting stuck between the vLLM generation and data preparation threads.

* Add detailed logging to trace request flow through queues

- Log every step from queue insertion to engine processing
- Track queue object references to identify routing issues
- Log prefetch task startup and iterations
- Add logging to _add_request and _process_request methods
- Log when requests are successfully retrieved from queues
- Track active tasks and batch sizes

This will help identify where requests are getting stuck between submission and processing.

* Fix process_from_queue hanging issue

The bug was that the prefetch task was interfering with asyncio.wait. When
asyncio.wait included the prefetch task with return_when=FIRST_COMPLETED, it
would return immediately when the prefetch task completed (which happens quickly
when the batch is full), without waiting for actual generation tasks.

This caused process_from_queue to return 0 almost immediately thousands of times,
even though there were active generation tasks running.

The fix:
- Only wait for actual generation tasks in self.active_tasks
- Don't include the prefetch task in the wait list
- Remove the special handling for prefetch task in the done set
- Properly wait for the timeout period when there are active tasks

This ensures process_from_queue actually waits for generation to complete
instead of returning immediately.

* Add assertion to detect race condition in request ID reuse

* Fix premature exit in vllm_utils3 when using tools

The _should_exit method was incorrectly exiting when active_tasks briefly
went to 0 during tool processing. This caused the vLLM actor to exit early,
leaving requests unprocessed and causing the main training loop to hang.

Changes:
- Check for incomplete requests in request_outputs before deciding to exit
- Add a grace period when no tasks are active to allow for gaps in tool processing
- Prevent premature exit during multi-step tool completions

This fixes the hang issue when running tool_grpo_fast.sh while maintaining
compatibility with non-tool scripts like single_gpu_on_beaker.sh.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removed metadata from logs

* Add detailed logging to _process_request to debug tool use hang

Added comprehensive logging to track the exact flow through _process_request:
- Log entry/exit points with request IDs
- Track loop iterations and await statements
- Log tool triggering and execution
- Track when requests cannot continue
- Log lock acquisition and output collection

This will help identify where the hang occurs during tool processing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Add detailed logging around tokenizer access to debug hang

Added logging to identify exactly where the hang occurs when processing
tool outputs. Logs track:
- Before/after accessing self.llm_engine.engine.tokenizer
- Before/after encoding tool output text
- The actual text being encoded and number of tokens

This will help determine if the hang is during tokenizer access or encoding.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Updated endpoint

* Add detailed logging between lines 653-682 to identify exact hang location

- Add logging after creating prompt_and_tool_output_token
- Add logging after calculating excess
- Add logging for can_continue state changes
- Add logging after extending token_ids and masks
- Add logging after calculating new_sample_tokens

This will help pinpoint exactly which line causes the hang when processing
empty tool outputs.

* Add detailed logging to debug token_ids access hang

* Fix tuple/list concatenation issue in vllm_utils3.py

The issue was that request_output.outputs[0].token_ids is returned as a tuple
from VLLM, but the code was trying to:
1. Extend it with .extend() (which only works on mutable lists)
2. Concatenate it directly with lists using +

Fixed by converting token_ids to a list when first storing it in request_output.

* Fix tuple concatenation issue in tool execution path

Convert output.prompt_token_ids from tuple to list before concatenation
to prevent hanging when processing tool outputs. VLLM returns tuples
which cannot be concatenated with lists using the + operator.

* Add detailed debugging to track types during token concatenation

Added logging to understand what types are being concatenated and where
the hang occurs in the tool execution path.

* Fix attribute access for max_model_len in tool execution path

The AsyncLLMEngine wrapper doesn't have model_config directly.
Need to access it through self.llm_engine.engine.model_config instead.
This was causing a hang when trying to access the attribute.

* updated to use timeouterror

* fixed syntax error

* More logging

* Updated code

* Fixed loop

* Fixed logging

* Attempted to mirror the synchronous loop.

* hold the lock less.

* less frequent logging

* Updated behaviour to match

* Fixed max exceeded calls

* Updated tool path behaviour

* Fix request ID collision in async engine for tool continuations

Use unique iteration-specific request IDs (e.g. train_4_20273_3_iter2) to avoid
vLLM engine conflicts when handling multiple iterations with tools. The base
request_id is still used as cache_salt for optimal prefix caching.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Removed the lock

* CLeaned up PR.

* Update async engine (#1043)

* Longer sleep

* Updated mason.py to set default env vars centrally, and set a much longer vllm iteration timeout.

* Removed vllm_enable_V1_multiprocessing.

* Set debug

* Set vllm_enforce_eager to large_test_script.

* Set inflight updates to False

* Added test script

* Fix vLLM async engine hanging by adding polling with timeouts

The vLLM AsyncLLMEngine was hanging because its background task wasn't
getting CPU time to process requests. The async generators were stuck
waiting indefinitely for outputs that never came.

This fix:
- Adds an explicit yield after getting the generator to let the background task start
- Replaces the blocking async for loop with a polling approach using asyncio.wait_for
- Yields control to the event loop on timeouts to allow background tasks to run
- Adds a maximum wait time to prevent indefinite hangs

This ensures the vLLM engine's background task gets regular opportunities
to process requests and yield outputs.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Fix vLLM async generator corruption from timeout

The previous fix using asyncio.wait_for on generator.__anext__() was
corrupting the generator state. After a timeout, the generator would
immediately raise StopAsyncIteration, causing requests to fail after
only 0.1 seconds.

This fix uses a task-based approach:
- Create a background task to consume the generator naturally
- Poll the task for completion while yielding control
- Never timeout on the generator itself to avoid corruption
- Cancel the task cleanly if overall timeout is reached

This ensures the generator runs normally while still yielding control
to the vLLM engine's background processing.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Run AsyncLLMEngine in separate thread to fix event loop issues

The vLLM AsyncLLMEngine wasn't getting proper event loop time in the
Ray actor context, causing generators to hang. This fix:

1. Runs the AsyncLLMEngine in a dedicated thread with its own event loop
2. Converts Ray-called methods to synchronous with thread-safe bridges
3. Uses asyncio.run_coroutine_threadsafe to submit work to engine thread

Key changes:
- Added _run_engine_loop() to create dedicated thread with event loop
- Made process_from_queue() synchronous, delegating to _async version
- Created sync wrappers for init_process_group, update_weight, ready, etc.
- Engine thread runs continuously, ensuring background tasks get CPU time

This ensures the vLLM engine's background task runs properly without
event loop starvation, fixing the hanging issue after step 1.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Revert "Run AsyncLLMEngine in separate thread to fix event loop issues"

This reverts commit 127c361.

* Simplify generate_one_completion by removing polling wrapper

The polling wrapper with asyncio.create_task was adding unnecessary
complexity. Going back to the simple async for loop to let vLLM's
generator work naturally.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>

* Isolate async code to prefetch threads using queue-based approach

- Added thread-safe completion queue for processed results
- Modified _process_request to check completion and push to queue
- Converted process_from_queue to synchronous method
- Removed obsolete _check_and_process_completed_requests method
- Run async components in dedicated thread with own event loop
- Use standard queue operations for synchronous result pulling

This approach avoids event loop starvation issues by ensuring
vLLM's AsyncLLMEngine runs in its own dedicated thread.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix AsyncLLMEngine initialization in dedicated thread

- Create AsyncLLMEngine inside the thread's event loop in __init__
- Remove lazy initialization methods (_ensure_engine_initialized, etc)
- Ensure AsyncLLMEngine is created in the same event loop where it's used
- Use threading.Event to wait for initialization completion
- Keep thread-safe queue for result communication

This fixes the hanging issue by ensuring vLLM's AsyncLLMEngine
is created and used in the same event loop context.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix AsyncLLMEngine initialization timeout for tool scripts

- Increase initialization timeout from 30s to 120s for tool scripts
- Add detailed logging for each initialization step
- Move prefetch task creation after init complete signal
- Add exc_info=True to error logging for better debugging

This fixes the timeout issue that was occurring with tool scripts
which need more time to initialize due to additional resources.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix AsyncLLMEngine double event loop conflict

The hanging issue was caused by creating two separate event loops:
- Our thread created its own event loop with run_forever()
- AsyncLLMEngine with start_engine_loop=True created another background thread/loop

This caused a conflict where async operations couldn't execute properly.

Fix: Set start_engine_loop=False so AsyncLLMEngine uses our managed event loop
instead of creating its own. This eliminates the conflict and allows proper
async execution.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix AsyncLLMEngine creation to happen inside async context

The issue was that AsyncLLMEngine with start_engine_loop=False needs
to be created inside an async context, not synchronously.

Fix:
- Create an async setup() function inside the thread
- Use run_until_complete(setup()) to create engine in async context
- Then run_forever() to keep the loop running

This ensures AsyncLLMEngine is properly initialized in the correct
async context before we start using it.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* No more async def methods.

* Fixed loop.

* Cleaned up process_from_queue

* Fixed engine init

* Fixed background loop init.

* Fixed bug

* Fixes to code

* set infligth updates to false

* Add assert_threaded_actor to check Ray actor type

* Fix: Use threaded Ray actors instead of async actors

The issue was that Ray actors default to async actors with uvloop event loop,
which conflicts with our custom event loop setup in AsyncLLMEngine. Single GPU
mode worked because it had different timing, but multi-node consistently failed.

Solution: Use max_concurrency=1000 option to create threaded actors that don't
have event loops, avoiding the conflict entirely.

* Remove async method to force threaded actor

Ray creates an AsyncActor if ANY async def method exists in the class.
Removed unused _q_get_async method to ensure LLMRayActor is a threaded actor,
avoiding event loop conflicts with AsyncLLMEngine.

* Add debug logging to identify async methods

* Add more details to event loop assertion error

* Update code

* Fix event loop detection - use get_running_loop instead of get_event_loop

The issue was that asyncio.get_event_loop() in Python 3.10+ creates a loop if
none exists, so it always succeeded. We need to check for a RUNNING loop with
get_running_loop() which only succeeds in async actors.

* Updated timeout

* Set NCCL_DEBUG to warn.

* Less logs.

* more logging

* Added memray dep

* Moved logging

* fixed logging

* Updated logging.

* Fix CUDA OOM: Initialize AsyncLLMEngine in main thread

The OOM error was caused by AsyncLLMEngine being initialized in a separate
thread, creating CUDA context isolation. When update_weight() tried to allocate
memory from the main thread, it couldn't access the memory pool from the async
thread.

Fix: Initialize AsyncLLMEngine in the main thread to ensure all CUDA operations
happen in the same context. Only the event loop runs in a separate thread now.

* Fix CUDA OOM: Run collective_rpc operations in event loop thread

The OOM error was caused by CUDA context isolation between threads.
When AsyncLLMEngine is created in a separate thread, all CUDA operations
including model weights and KV cache allocation happen in that thread's
CUDA context.

When update_weight() was called from the main thread (via Ray RPC), it
tried to allocate memory in a different CUDA context, causing apparent
OOM even though memory was available in the other thread's context.

Fix: Use asyncio.run_coroutine_threadsafe() to execute all collective_rpc
operations (init_process_group, update_weight, update_weight_cuda_ipc) in
the event loop thread where the engine and CUDA context were created.

This ensures all CUDA operations happen in the same thread/context while
maintaining the async functionality of AsyncLLMEngine.

* now we use collective_rpc

* Undid changes

* Fix vLLM hanging after weight updates by running collective_rpc in thread pool

The issue was that collective_rpc operations (which are blocking) were being
run in the event loop thread, blocking the AsyncLLMEngine's background loop
from processing queued requests.

Changes:
- Run collective_rpc via loop.run_in_executor() to avoid blocking the event loop
- Apply fix to init_process_group, update_weight, and update_weight_cuda_ipc
- Use functools.partial to properly pass keyword arguments to collective_rpc

This allows the async engine to continue processing requests during weight updates.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Fix CUDA context: Use call_soon_threadsafe instead of run_in_executor

Previous fix (b271bd0) broke CUDA context by running collective_rpc in
a ThreadPoolExecutor worker thread instead of the event loop thread.

Root cause: collective_rpc MUST run on the same thread as AsyncLLMEngine
(the event loop thread) to maintain CUDA context. Using run_in_executor
moved it to a different thread, breaking CUDA operations and causing vLLM
engines to hang after ~46 completions.

Solution: Use loop.call_soon_threadsafe() to schedule collective_rpc as
a synchronous callback on the event loop thread. This:
- Maintains correct CUDA context (runs in event loop thread)
- Blocks the event loop during weight updates (necessary for CUDA)
- Properly handles exceptions and return values via threading.Event

Changes:
- Replace run_coroutine_threadsafe + run_in_executor pattern
- Use call_soon_threadsafe with threading.Event for synchronization
- Apply to init_process_group, update_weight, and update_weight_cuda_ipc

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Move threading import to top of file

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add diagnostic logging for AsyncLLMEngine background loop

Add comprehensive logging to diagnose why the AsyncLLMEngine background
loop stops working after weight updates:

1. update_weight(): Log when collective_rpc starts/ends, log background
   loop task state before/after the blocking call
2. generate_one_completion(): Log every 10 seconds if stuck waiting for
   outputs, log error if generator exits without producing output
3. _prefetch_requests(): Periodically check and log background loop task
   health (every 5 seconds), detect if task is done/cancelled/failed

This will help identify:
- Is the background loop task still alive after blocking?
- Does it become done/cancelled during the collective_rpc?
- Are generators stuck waiting or exiting early?
- What's the exact state transition during weight sync?

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Fix attribute name for background loop task diagnostic

Fix typo: use _background_loop_unshielded instead of
_background_loop_unshielded_task to match vLLM v0.9.1's actual
attribute name.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Revert to simple collective_rpc implementation

Revert all async/threading complexity for collective_rpc operations.
The root cause was that blocking the event loop thread with call_soon_threadsafe
creates orphaned async tasks - the AsyncLLMEngine background loop gets stuck
at await points and never recovers.

Changes:
- Remove call_soon_threadsafe + threading.Event approach
- Remove diagnostic logging
- Revert to simple: directly call collective_rpc from Ray RPC handler thread
- This was the original implementation before commit ed66c24

This may cause CUDA OOM if context isolation is still an issue, but we'll
test if that actually occurs consistently or if it was a flaky/different problem.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Set max concurrency = 1

* longer timeout

* Add AsyncLLMEngine background loop health check

Check if the AsyncLLMEngine background loop is running at the start of
process_from_queue() and restart it if needed. Also assert that the
background loop task is not done/cancelled to catch issues early.

This fixes a potential issue where the background loop stops after weight
sync, causing inference to hang indefinitely.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Added check that tasks aren't dying.

* Update max concurrency

* Added nccl debug

* added codex todo

* Fixed error

---------

Co-authored-by: Claude <[email protected]>

* Removed debugging code

* Ran linter

* Fixed issue.

* Minimized differences between new code and old.

* Fixed cluster warning in large_test_script.sh.

* Cleaned up PR.

* Updated assert threaded actor class.

* Fixed class.

* Set default values for large_test_script.sh

* set enforce eager

* now, we set infligth updates false

* Now, we don't set enforce eager.

* Updated large_test_script.sh

* Fixed env var issue

* Now we set inflight updates true

* trying to start/stop background loop

* Removed start of loop

* now, we use sleep/wake_up to make things work.

* Set inflight true on single

* Removed sleep/wakeup code

* Updated code

* Fixed typo

* Ran linter

* Fixed bug

* switched to use the v1 engine.

* updated code

* Fixed issue where we were calling v0 APIs.

* Fixed hanging issue

* Updated code to remove pause_generation calls.

* updated code

* Fixed abort issue

* updated code

* Add diagnostic logging and fix vLLM v1 compatibility

- Switch from add_request() to generate() to properly start output handler
- Remove break statement to prevent spurious request aborts
- Add exception handling and diagnostic logging to debug hanging issues
- Fix collective_rpc calls to use async versions with run_coroutine_threadsafe

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Set vllm logs to debug

* Updated vllm version to 10.2.

* Updated flash attn version

* Ran uv sync

* Fix AsyncLLMEngine hanging by creating it within running event loop

The AsyncLLMEngine was being created before the event loop started running,
which prevented the output handler from starting automatically. This caused
all generation requests to hang indefinitely.

Fixed by creating the engine from within the running loop context using
run_until_complete(), ensuring AsyncLLM.__init__ detects the running loop
and starts the output handler.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Move _init_engine_async to module-level function

The async method was triggering the threaded actor assertion that prevents
async methods on the Ray actor class. Moving it to a free function resolves
this while maintaining the same functionality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add comprehensive diagnostic logging for async task tracing

- Add logging to verify AsyncLLMEngine output handler initialization
- Track async task lifecycle (entry/exit) in process_request_async()
- Monitor async generator iteration in generate_one_completion()
- Add periodic event loop health monitor (every 30s)
- Log pending tasks to identify hung requests

This will help trace why some requests (train_1_44, train_1_3, etc.) hang
while others complete successfully.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add diagnostic logging to trace process_from_queue exit behavior

- Log _should_exit() decision-making with stop status and pending work counts
- Track process_from_queue() while loop entry/exit and iteration count
- Log completion_queue operations (get success, empty exceptions)
- This will reveal why the loop exits early leaving items in the queue

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add diagnostic logging for weight sync stop_requested toggle

Added INFO level logging to trace the weight sync stop/resume cycle:
- Log before/after setting should_stop to True
- Log entry to try block and completion of weight broadcast
- Log entry to finally block before setting should_stop to False
- Log in ActorManager.set_should_stop to confirm state changes

This will help diagnose why stop_requested stays True forever instead of toggling back to False after weight sync completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add diagnostic logging to trace weight broadcast deadlock

Added INFO level logging to:
1. update_weight() in vllm_utils3.py:
   - Log entry with param name/dtype/shape
   - Log async task scheduling
   - Log waiting for result
   - Log result completion

2. broadcast_to_vllm() in grpo_fast.py:
   - Log entry with rank
   - Log total parameters to broadcast
   - Log each parameter being processed (name, shape, count)
   - Log update_weight.remote() future creation
   - Log before/after torch.distributed.broadcast
   - Log exit with future count

This will reveal exactly where the deadlock occurs:
- If update_weight() logs don't appear, Ray actor isn't receiving calls
- If "Async task scheduled" appears but no result, event loop isn't processing
- If broadcast doesn't complete, torch.distributed collective is deadlocked

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add event loop diagnostic logging to update_weight

Added logging to verify which event loops are being used:
- self.loop reference
- self.loop.is_running() status
- asyncio.get_event_loop() reference
- Future object details

This will help identify if there's an event loop mismatch causing the
asyncio.run_coroutine_threadsafe() call to fail to schedule the task.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Fix event loop mismatch in async RPC calls

The deadlock was caused by an event loop mismatch:
- self.loop (running in dedicated thread) vs
- asyncio.get_event_loop() in calling thread (not running)

When collective_rpc_async() was scheduled via run_coroutine_threadsafe,
it ran on the correct loop but internally used the wrong loop context.

Fix: Call asyncio.set_event_loop(self.loop) before run_coroutine_threadsafe
in all methods that use it:
- update_weight()
- update_weight_cuda_ipc()
- init_process_group()
- reset_prefix_cache()

This ensures asyncio.get_event_loop() inside the coroutines returns
the correct running loop.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add assertions to verify event loop consistency

Added assertions and logging to catch event loop mismatches:

1. In _init_engine_async():
   - Assert asyncio.get_running_loop() == actor.loop
   - Log both loops to verify engine is created on correct loop

2. In update_weight():
   - Log self.loop and its running status
   - After set_event_loop, assert get_event_loop() == self.loop
   - Catch any loop mismatch before async call

These assertions will fail fast if there's a loop inconsistency,
helping identify exactly where the problem occurs.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Fix async RPC deadlock by using sync-to-async bridge

The deadlock occurred because collective_rpc_async() internally calls
asyncio.get_running_loop().create_future(), which requires execution
within the event loop thread. Using run_coroutine_threadsafe() from
external threads doesn't provide the right execution context.

Solution: Add sync-to-async bridge pattern:
- Add _sync_request_queue for cross-thread communication
- Add _process_sync_requests() async task in event loop
- Add _call_async_from_sync() helper for sync methods
- Update init_process_group(), update_weight(), update_weight_cuda_ipc(),
  and reset_prefix_cache() to use the bridge

This matches PipelineRL's architecture where async work happens in the
event loop thread context, but keeps external API synchronous.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Tried more fixes

* Updated to remove generate thread

* Updated code to add processing

* Chnage architecture

* Now we set vllm_insecure

* Set inflight false

* removed message serialization

* removed some logs

* Another attempt to fix hang

* lots of logging changes

* Ran linter.

* Reset scripts.

* Undid changes to mason.py

* Cleaned up PR.

* Cleaned up PR.

* Cleaned up PR.

* Cleaned up PR.

* Removed timeouterrro

* Cleaned up PR

* Uses async for

* Now, we handle tools.

* Cleaned assert code.

* Attempty at fixing code.

* Cleaned up assert

* Another attempt at fixing the bug

* Updated code

* Fix tool execution hanging by using dedicated executor instead of asyncio.to_thread

The issue was that asyncio.to_thread uses a default thread pool with limited workers
(min(32, cpu_count + 4)). With 32 concurrent requests (8 prompts x 4 samples), this
could exhaust the thread pool and cause deadlock.

Now using the dedicated ThreadPoolExecutor with 20 workers that's already created
for tool execution, matching the behavior of the main branch.

* Fix async event loop issue - use get_running_loop() instead of get_event_loop()

The issue was that process_request_async runs in a specific event loop via
asyncio.run_coroutine_threadsafe, but we were calling get_event_loop() which
could return a different loop or fail in a thread context.

asyncio.to_thread internally uses get_running_loop(), so we must do the same
when using run_in_executor to ensure we use the correct event loop.

* Add logging to check if executor is None during tool execution

* Add detailed logging to track tool execution and triggering

* Fix async event loop hanging by using unique request IDs for each iteration

The issue was that after a tool call, we would loop back and try to generate
again with the same sub_request_id. This caused vLLM to reject or hang on the
duplicate request ID. Now we append '_iterN' to create unique IDs for each
generation attempt within the same request.

* Add detailed logging to trace async generation hang

* Add detailed logging after tool execution to trace iteration hang

* Add fine-grained logging to debug model config access hang

* Add detailed logging to debug prompt concatenation hang

* Cache prompt_token_ids to avoid hang when accessing TokensPrompt property

* Fix undefined variable in assert_threaded_actor

* Updated code

* Set inflight false

* Fixed duplicate flag

* Simplified significantly

* Removed logs

* Simplified threading model

* Added handling for inflight_updates

* Inlined generate_one_completion

* Clean up

* More clean up

* Set inflight to true

* Cleaned up code.

* lots of cleanup

* Major refactor

* More PR cleanup

* Fixed code

* Updated script

* Cleaned up PR

---------

Co-authored-by: Claude <[email protected]>
Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants