-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[WIP][data] feat: TransferQueue - An asynchronous streaming data management system #3649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
0oshowero0
wants to merge
31
commits into
volcengine:main
Choose a base branch
from
TransferQueue:main_tq_submodule
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Support controller in TransferQueue * Fix import * Fix comments --------- Co-authored-by: liuximeng <[email protected]>
Added copyright and licensing information to the controller.py file.
Signed-off-by: 0oshowero0 <[email protected]>
* update client docstring Signed-off-by: 0oshowero0 <[email protected]> * fix n_sample related problems Signed-off-by: 0oshowero0 <[email protected]> --------- Signed-off-by: 0oshowero0 <[email protected]>
* Add metadata.py and test_simple_storage_unit.py * Add copyright and license information to test_simple_storage_unit.py * Apply suggestion from @Copilot Co-authored-by: Copilot <[email protected]> --------- Co-authored-by: Han Zhenyu 韩振宇 <[email protected]> Co-authored-by: Copilot <[email protected]>
Co-authored-by: liuximeng <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
* Origin recipe * Integrate TransferQueue with Ray Trainer * Fix codecheck * Fix codecheck * Fix codecheck * Fix codecheck * Fix * Fix codecheck --------- Co-authored-by: liuximeng <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
Signed-off-by: 0oshowero0 <[email protected]>
* fix chinese comments & add TODO * provide general DataProto<->BatchMeta decorator Signed-off-by: 0oshowero0 <[email protected]> * fix Signed-off-by: 0oshowero0 <[email protected]> * fix Signed-off-by: 0oshowero0 <[email protected]> * fix Signed-off-by: 0oshowero0 <[email protected]> * optimize code Signed-off-by: 0oshowero0 <[email protected]> * fix Signed-off-by: 0oshowero0 <[email protected]> * fix Signed-off-by: 0oshowero0 <[email protected]> --------- Signed-off-by: 0oshowero0 <[email protected]>
a50460e
to
14ad39e
Compare
14ad39e
to
02846de
Compare
Current Progress:
|
65affa2
to
eb31070
Compare
* feat: Support conversion between dataproto and batchmeta * update
Signed-off-by: 0oshowero0 <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR introduces the TransferQueue data management module to verl, aiming to accelerate experience data transfer and address performance bottlenecks in post-training systems. Detailed design rationale is available in our RFC (#2662).
This PR adds TransferQueue as a git submodule into
verl/experimental/transfer_queue
. Besides, we provide end-to-end scripts that integrate verl with TransferQueue.TransferQueue is a high-performance data storage and transfer module with panoramic data visibility and streaming scheduling capabilities, optimized for efficient dataflow in post-training workflows (in progress).
The system will introduce the following core components:
TransferQueueClient: Deployed on each
Worker
, manages the communication with TransferQueue system via simple put/get semantics.TransferQueueController: Centralized dataflow scheduler tracking the production and consumption status of training samples.
TransferQueueStorage: Distributed storage units that holds the actual experience data.
The primary motivation for integrating TransferQueue to verl now is to alleviate the data transfer bottleneck of the single controller
RayPPOTrainer
. Currently, allDataProto
objects must be routed throughRayPPOTrainer
, resulting in a single point bottleneck of the whole post-training system.Leveraging TransferQueue, we separate experience data transfer from metadata dispatch by
DataProto
withBatchMeta
(metadata) andTensorDict
(actual data) structuresFor
WorkerGroup
class, we hide the above translation process by decorator. ForAgentLoop
related class, we explicitely do the adaption inAgentLoopBase
.Checklist Before Starting
[{modules}] {type}: {description}
(This will be checked by the CI){modules}
includefsdp
,megatron
,sglang
,vllm
,rollout
,trainer
,ci
,training_utils
,recipe
,hardware
,deployment
,ray
,worker
,single_controller
,misc
,perf
,model
,algo
,env
,tool
,ckpt
,doc
,data
,
like[megatron, fsdp, doc]
{type}
is infeat
,fix
,refactor
,chore
,test
[BREAKING]
to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batching
Test
We've validated TransferQueue functionality through
API and Usage Example
The primary interaction points are
AsyncTransferQueueClient
andTransferQueueClient
, serving as the communication interface with the TransferQueue system.Core client interfaces:
You may refer to the example here, where we mimics the verl usage in both async & sync scenarios:
https://github.com/TransferQueue/TransferQueue/tree/dev/recipe/simple_use_case.
For verl integration, we put TransferQueue as a submodule in
verl/experimental/transfer_queue
. So please run the following git command first:Then you can try our recipe (still in developing).
Design & Code Changes
Refer to our Paper, RFC, and Zhihu post :)
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
ci-request
channel in theverl
Slack workspace. (If not accessible, please try the Feishu group (飞书群).)