Skip to content

Conversation

finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Sep 16, 2025

By moving to the async engine, we have a much cleaner central processing loop, as we process each request in a dedicated function, relying on vllm to handle the batch processing.

SGLang and TensorRT-LLM both use async engines, so by moving to VLLM's async API, it will be much easier to test out those engines.

No significant speed bump. Some minor performance changes, but it's unclear if any of them are substantial: Wandb. Note that time/weight_sync has gone up.

Runs with inflight_updates=True:

  1. Single GPU debug run: Beaker.
  2. Single GPU debug run w/ tool use: Beaker.
  3. Multi-node test script: Beaker.

And with inflight_updates=False:

  1. Single GPU: Beaker
  2. Multi-node: Beaker

finbarrtimbers and others added 30 commits September 12, 2025 14:14
- Use dataclasses.replace to properly set CompletionOutput.index when tools disabled
- Make request_tracking keys unique by combining training_step and dataset_index
- Pass training_step through to GenerationResult to enable proper tracking key construction

This fixes the hang that occurred when tools were disabled due to:
1. Incorrect index values preventing output matching
2. Duplicate dataset indices across training steps causing prefetch thread crash
The recent commit changed the tracking system to use combined keys
(training_step_dataset_index), but there was a bug where we tried to
access request_metadata[request_id]["training_step"] after calling
_cleanup_request_data, which removes the metadata.

This fix gets the training_step value before calling _cleanup_request_data.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
This assertion will help debug the hanging issue in single_gpu_on_beaker.sh
by ensuring that all n sub-requests are properly added to vllm_active_requests.

If the assertion fails, it means some sub-requests are not being tracked
properly after being added to vLLM.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Move index field assignment before adding CompletionOutput to request_outputs
to ensure _maybe_process_and_insert can properly identify sub-requests.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
When tools are disabled and n>1, fix the index field immediately when
receiving outputs from vLLM. Each sub-request gets the correct index
extracted from its ID (e.g., train_3_12_1 -> index=1).

Added assertions to verify the fix is working correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
The issue occurred when tools were disabled due to premature removal
of sub-requests from vllm_active_requests. This caused validation
failures where expected sub-requests couldn't be found.

The fix ensures sub-requests are removed from vllm_active_requests
at the correct time - after determining they won't go to tools but
before calling _finalize_sub_request to avoid deadlock in
_maybe_process_and_insert.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
* remove flash infer from dependencies, mason, and benchmark script (#1000)

* update eval names (#1001)

* Puzzle verifier (#1003)

* adding PuzzleMatcherVerifier

* Update ground_truth_utils.py

* Create test_ground_truth_utils.py

* Update ground_truth_utils.py

* Update ground_truth_utils.py

* lint

* lint, address gemini

* Puzzle -> puzzle

* Fixed tests.

* Cleaned up test

* Cleaned up test

---------

Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: Finbarr Timbers <[email protected]>

* fixes for manual eval jobs with new cluster names (#1002)

* fixes for manual eval jobs with new cluster names

* fix default cluster names

* Disables cascade attention. (#1006)

* Disables cascade attention.

* fix issue with creation of LLMEngine.

* whoops wrong name (#1007)

* Clean up

* Removed request_tracking.

* Removed tracking key call

* Fixed attribute error

* Removed tracking key

* Removed asserts

* Fix KeyError in _poll_tool_futures by restoring missing concat_outputs assignment

The removal of asserts accidentally removed the line that stores output_copy
in tracking['concat_outputs'], causing a KeyError when _poll_tool_futures
tries to access the tracking data for tool processing.

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Fix linter errors: remove unused variables

- Removed unused 'prefix' variable in add_request function
- Removed unused 'request_id' variable in _prefetch_worker
- Removed unused 'training_step' variable in _maybe_process_and_insert
- Removed unused 'context' variable in _poll_tool_futures

🤖 Generated with Claude Code

Co-Authored-By: Claude <[email protected]>

* Undid local changes.

* Fixed accidental deletion

* Removed copy

* Cleaned function signatures.

* Cleaned up finalize_outputs

* Added script. (#1009)

* Changed vllm_active_requests into a set.

* Lots of clean up.

* Cleaned up _maybe_process_and_insert significantly.

* Attempt at fixing empty response issue.

* Fixed errors and testing with fake asserts. (#1010)

* Small refactor to prepare for the `backfill-prompts` PR (#1011)

* Changes have been made.

* Updated function signature of _finalize_outputs.

* Added tests for make_request_id.

* Deleted old function.

* Cleaned up test

* Removed spurious except clause

* Fixed bug where we were passing the wrong sampling params to _finalize_outputs.

* Fixed tests.

* Ran linter.

* attempted to fix bug in completions.

* Removed _finalize_outputs, renamed process_outputs.

* Check we have enough data for bsz/prefill (#1012)

* check that we have enough data

* lint

* comments

* update comment

* minor update

* Cleaned up code significantly.

* Cleaned up tests.

* Combine process_output and _process_completed_request into process_completed_request function.

* Removed duplicate definition of _extract_base_request_id.

* Fixed linter error.

---------

Co-authored-by: Saurabh Shah <[email protected]>
Co-authored-by: Hamish Ivison <[email protected]>
Co-authored-by: Teng Xiao <[email protected]>
Co-authored-by: FaezeBr <[email protected]>
Co-authored-by: Claude <[email protected]>
- Added inflight_updates boolean argument to Args class in grpo_fast.py
- Passed inflight_updates through to vllm_utils3.create_vllm_engines
- Modified LLMRayActor to accept and handle inflight_updates mode
- When inflight_updates is True, process_from_queue returns immediately when should_stop is set
- When inflight_updates is False, maintains existing behavior (waits for pending work)

This allows for quick pausing and resumption of request processing during model updates.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Added _should_exit method to encapsulate all exit logic
- Uses early returns with no nested conditions for clarity
- Optimizes by checking expensive operations only when needed
- Removed duplicate exit condition checks from the main loop
- Simplified while loop to use while not self._should_exit()

This makes the code more maintainable and the exit conditions clearer.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments. Currently tool use is broken on main, and I'd like to fix that before we merge this!

RAY_WAIT_TIMEOUT_S = 0.1
KV_CACHE_RETRIES = 5
KV_CACHE_RETRY_SLEEP_S = 0.2
WEIGHT_UPDATE_SLEEP_INTERVAL_S = 0.1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add short explanations of each constant? Some are obvious (weight_update_timeout), but others are a bit less clear (default_workers, kv_cache_retries)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over all of these. A bunch of them were redundant and could be removed; others, imo, should be handled by timeouts in grpo_fast.py on the actual ray.get calls, so I removed them. I think the remaining ones are clear from the names. What do you think?

cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@hamishivi hamishivi disabled auto-merge October 17, 2025 04:27
Copy link
Collaborator

@hamishivi hamishivi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to get this in! I added my fixes from the previous PR to fix tool use. Left two minor comments (from testing, everything works fine anyway).

current_sampling_params.max_tokens = new_sample_tokens

complete_output = vllm.CompletionOutput(
index=sub_request_id.split("_")[-1],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on the cursor thing here - i think the typing on the CompletionOutput class says index should be an int?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, yes. nice.

continue

request = actor.prompt_queue.get()
add_request(actor, request)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Blocking Queue Call Prevents Proper Shutdown

The _prefetch_worker uses a blocking prompt_queue.get() call. This can cause the worker thread to hang indefinitely when the queue is empty, preventing _should_stop() checks and leading to improper shutdown. The previous implementation included a timeout for graceful termination.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants