Changes LLMRayActor to use vllm's AsyncLLMEngine #1016

finbarrtimbers · 2025-09-16T14:24:46Z

By moving to the async engine, we have a much cleaner central processing loop, as we process each request in a dedicated function, relying on vllm to handle the batch processing.

SGLang and TensorRT-LLM both use async engines, so by moving to VLLM's async API, it will be much easier to test out those engines.

No significant speed bump. Some minor performance changes, but it's unclear if any of them are substantial: Wandb. Note that time/weight_sync has gone up.

Runs with inflight_updates=True:

Single GPU debug run: Beaker.
Single GPU debug run w/ tool use: Beaker.
Multi-node test script: Beaker.

And with inflight_updates=False:

Single GPU: Beaker
Multi-node: Beaker

- Use dataclasses.replace to properly set CompletionOutput.index when tools disabled - Make request_tracking keys unique by combining training_step and dataset_index - Pass training_step through to GenerationResult to enable proper tracking key construction This fixes the hang that occurred when tools were disabled due to: 1. Incorrect index values preventing output matching 2. Duplicate dataset indices across training steps causing prefetch thread crash

The recent commit changed the tracking system to use combined keys (training_step_dataset_index), but there was a bug where we tried to access request_metadata[request_id]["training_step"] after calling _cleanup_request_data, which removes the metadata. This fix gets the training_step value before calling _cleanup_request_data. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

This assertion will help debug the hanging issue in single_gpu_on_beaker.sh by ensuring that all n sub-requests are properly added to vllm_active_requests. If the assertion fails, it means some sub-requests are not being tracked properly after being added to vLLM. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

Move index field assignment before adding CompletionOutput to request_outputs to ensure _maybe_process_and_insert can properly identify sub-requests. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

When tools are disabled and n>1, fix the index field immediately when receiving outputs from vLLM. Each sub-request gets the correct index extracted from its ID (e.g., train_3_12_1 -> index=1). Added assertions to verify the fix is working correctly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…o kill LLMRayActor.

The issue occurred when tools were disabled due to premature removal of sub-requests from vllm_active_requests. This caused validation failures where expected sub-requests couldn't be found. The fix ensures sub-requests are removed from vllm_active_requests at the correct time - after determining they won't go to tools but before calling _finalize_sub_request to avoid deadlock in _maybe_process_and_insert. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

* remove flash infer from dependencies, mason, and benchmark script (#1000) * update eval names (#1001) * Puzzle verifier (#1003) * adding PuzzleMatcherVerifier * Update ground_truth_utils.py * Create test_ground_truth_utils.py * Update ground_truth_utils.py * Update ground_truth_utils.py * lint * lint, address gemini * Puzzle -> puzzle * Fixed tests. * Cleaned up test * Cleaned up test --------- Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: Finbarr Timbers <[email protected]> * fixes for manual eval jobs with new cluster names (#1002) * fixes for manual eval jobs with new cluster names * fix default cluster names * Disables cascade attention. (#1006) * Disables cascade attention. * fix issue with creation of LLMEngine. * whoops wrong name (#1007) * Clean up * Removed request_tracking. * Removed tracking key call * Fixed attribute error * Removed tracking key * Removed asserts * Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment The removal of asserts accidentally removed the line that stores output_copy in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures tries to access the tracking data for tool processing. 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Fix linter errors: remove unused variables - Removed unused 'prefix' variable in add_request function - Removed unused 'request_id' variable in _prefetch_worker - Removed unused 'training_step' variable in _maybe_process_and_insert - Removed unused 'context' variable in _poll_tool_futures 🤖 Generated with Claude Code Co-Authored-By: Claude <[email protected]> * Undid local changes. * Fixed accidental deletion * Removed copy * Cleaned function signatures. * Cleaned up finalize_outputs * Added script. (#1009) * Changed vllm_active_requests into a set. * Lots of clean up. * Cleaned up _maybe_process_and_insert significantly. * Attempt at fixing empty response issue. * Fixed errors and testing with fake asserts. (#1010) * Small refactor to prepare for the `backfill-prompts` PR (#1011) * Changes have been made. * Updated function signature of _finalize_outputs. * Added tests for make_request_id. * Deleted old function. * Cleaned up test * Removed spurious except clause * Fixed bug where we were passing the wrong sampling params to _finalize_outputs. * Fixed tests. * Ran linter. * attempted to fix bug in completions. * Removed _finalize_outputs, renamed process_outputs. * Check we have enough data for bsz/prefill (#1012) * check that we have enough data * lint * comments * update comment * minor update * Cleaned up code significantly. * Cleaned up tests. * Combine process_output and _process_completed_request into process_completed_request function. * Removed duplicate definition of _extract_base_request_id. * Fixed linter error. --------- Co-authored-by: Saurabh Shah <[email protected]> Co-authored-by: Hamish Ivison <[email protected]> Co-authored-by: Teng Xiao <[email protected]> Co-authored-by: FaezeBr <[email protected]> Co-authored-by: Claude <[email protected]>

- Added inflight_updates boolean argument to Args class in grpo_fast.py - Passed inflight_updates through to vllm_utils3.create_vllm_engines - Modified LLMRayActor to accept and handle inflight_updates mode - When inflight_updates is True, process_from_queue returns immediately when should_stop is set - When inflight_updates is False, maintains existing behavior (waits for pending work) This allows for quick pausing and resumption of request processing during model updates. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Added _should_exit method to encapsulate all exit logic - Uses early returns with no nested conditions for clarity - Optimizes by checking expensive operations only when needed - Removed duplicate exit condition checks from the main loop - Simplified while loop to use while not self._should_exit() This makes the code more maintainable and the exit conditions clearer. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

hamishivi

Added some comments. Currently tool use is broken on main, and I'd like to fix that before we merge this!

pyproject.toml

hamishivi · 2025-10-15T21:41:40Z

open_instruct/vllm_utils3.py

+RAY_WAIT_TIMEOUT_S = 0.1
+KV_CACHE_RETRIES = 5
+KV_CACHE_RETRY_SLEEP_S = 0.2
 WEIGHT_UPDATE_SLEEP_INTERVAL_S = 0.1


Could we add short explanations of each constant? Some are obvious (weight_update_timeout), but others are a bit less clear (default_workers, kv_cache_retries)

I went over all of these. A bunch of them were redundant and could be removed; others, imo, should be handled by timeouts in grpo_fast.py on the actual ray.get calls, so I removed them. I think the remaining ones are clear from the names. What do you think?

open_instruct/vllm_utils3.py

hamishivi

Great to get this in! I added my fixes from the previous PR to fix tool use. Left two minor comments (from testing, everything works fine anyway).

hamishivi · 2025-10-17T04:14:00Z

open_instruct/vllm_utils3.py

+        current_sampling_params.max_tokens = new_sample_tokens
+
+    complete_output = vllm.CompletionOutput(
+        index=sub_request_id.split("_")[-1],


+1 on the cursor thing here - i think the typing on the CompletionOutput class says index should be an int?

ah, yes. nice.

open_instruct/vllm_utils3.py

cursor · 2025-10-17T21:07:26Z

open_instruct/vllm_utils3.py

+            continue
+
+        request = actor.prompt_queue.get()
+        add_request(actor, request)


Bug: Blocking Queue Call Prevents Proper Shutdown

The _prefetch_worker uses a blocking prompt_queue.get() call. This can cause the worker thread to hang indefinitely when the queue is empty, preventing _should_stop() checks and leading to improper shutdown. The previous implementation included a timeout for graceful termination.

finbarrtimbers and others added 30 commits September 12, 2025 14:14

Fixed bug in tool code.

2c63fe7

Ran linter. Tool script seems to run fine.

78dd885

Fixed bug in non-tool use.

aa26e41

Updated code to fix index bug.

fcd4d78

Undid host network change to tool_grpo_fast.sh.

02a403b

Fixed no host networking command from tool_grpo_fast.sh.

80194ec

Fixed incorrect index field.

22882af

Fixed index

c15f081

Changed loop

812699b

Set verbose to true

790b2cc

Fixed tracking key bug and added a health check for prefetch worker t…

b5d7567

…o kill LLMRayActor.

changed removal from vllm_active_requests to occur after _finalize

13db536

Merge branch 'main' into backfill-prompts

afbc8ab

Fixed linter error.

cd2c7d8

Removed comment.

9111672

Longer benchmark

3279680

undid changes to benchmark

0fa431d

Now, uses the async engine.

10c2b10

Fixed errors

6279598

Removed host networking from single

b6f576d

much simpler flow

02db12e

finbarrtimbers mentioned this pull request Oct 14, 2025

Deletes vllm_utils.py and renames vllm_utils3.py to vllm_utils.py. #1078

Draft

finbarrtimbers added 4 commits October 15, 2025 16:16

Merge branch 'main' into async-engine

c6eec7a

Ran uv sync

e2824d5

Simpler pyproject.toml

8a81bcf

Updated uv.lock

76bbbf0

hamishivi reviewed Oct 16, 2025

View reviewed changes

finbarrtimbers added 5 commits October 16, 2025 08:21

Updated.

d7f3f78

Cleaned up code.

eab53dd

Updated code to correctly calculate kv_cache_info

8eaeaad

Updated tests.

3e77b18

Updated code to use public APIs.

599e5b0

This comment was marked as outdated.

Sign in to view

finbarrtimbers and others added 12 commits October 16, 2025 09:15

Updated kv cache method

de8ad2b

Another attempt at fixing kv cache specs

a8645b1

Updated code

353561a

Added health check on loop

a7dc5c7

Attempt at fixing

ff23618

updated kv cache code

04cca39

removed recursive health check

7b5a7d4

Merge branch 'main' into async-engine

16ee997

Merge branch 'main' into async-engine

f3aaff1

Removed constant

5a625df

fix tool use

9a04eba

prevent concurrency issue

c7e403f

This comment was marked as outdated.

Sign in to view

hamishivi disabled auto-merge October 17, 2025 04:27

hamishivi approved these changes Oct 17, 2025

View reviewed changes

finbarrtimbers added 2 commits October 17, 2025 15:03

Merge branch 'main' into async-engine

0a54a34

Fixed cursor index bug.

797e246

cursor bot reviewed Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes LLMRayActor to use vllm's AsyncLLMEngine #1016

Changes LLMRayActor to use vllm's AsyncLLMEngine #1016

Uh oh!

finbarrtimbers commented Sep 16, 2025 •

edited

Loading

Uh oh!

hamishivi left a comment

Uh oh!

Uh oh!

hamishivi Oct 15, 2025

Uh oh!

finbarrtimbers Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

hamishivi left a comment

Uh oh!

hamishivi Oct 17, 2025

Uh oh!

finbarrtimbers Oct 17, 2025

Uh oh!

Uh oh!

cursor bot Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changes LLMRayActor to use vllm's AsyncLLMEngine #1016

Are you sure you want to change the base?

Changes LLMRayActor to use vllm's AsyncLLMEngine #1016

Uh oh!

Conversation

finbarrtimbers commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hamishivi Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

hamishivi left a comment

Choose a reason for hiding this comment

Uh oh!

hamishivi Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

finbarrtimbers Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor bot Oct 17, 2025

Choose a reason for hiding this comment

Bug: Blocking Queue Call Prevents Proper Shutdown

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented Sep 16, 2025 •

edited

Loading