[RFC] A centralized data management module for fine-grained dataflow management across RL tasks #2662
0oshowero0
started this conversation in
RFC
Replies: 4 comments 2 replies
-
A Chinese version (not an identical translation) can be found at Zhihu :) |
Beta Was this translation helpful? Give feedback.
0 replies
-
what do you propose as the next step? |
Beta Was this translation helpful? Give feedback.
2 replies
-
您好,已收到您的邮件,我会尽快给您回复。
|
Beta Was this translation helpful? Give feedback.
0 replies
-
We have updated the API and adaptation as follows. Now we are developing a simple demo to showcase the usage of the data system :) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
A Chinese version (not an identical translation) can be found at Zhihu :)
Motivation
We observe that the recently proposed RL frameworks for LLM post-training mainly adopt a task-separated paradigm. Distinct RL tasks such as actor rollout, reference inference, and actor update are distributed across separate hardware resources.
The core story lies in scaling efficiency. The emergence of reasoning models (e.g., OpenAI o1 and DeepSeek R1) has unveiled post-training scaling laws, resulting in significantly higher computational demands for RL training.
However, in task-collocated frameworks, the binding relationship between actor rollout and actor update significantly limits this scaling law, as their workloads exhibit distinct characteristics. Consequently, achieving equivalent throughput requires allocating varying numbers of computational resources to these tasks, as shown in Fig.2 of the StreamRL study.
The task-separated framework eliminates this constraint by allocating dedicated resources for each RL task, thus enabling efficient large-scale post-training while incurring significant challenges in data management. To reduce the pipeline bubbles in task-separated frameworks, we need to execute RL tasks in parallel. Current implementations of DataProto & Dispatch focus on block-level (global batch) data management across worker processes within a single RL task, which brings difficulties in fine-grained parallel execution at the micro-batch level across different RL tasks.
To address this challenge, we propose a centralized streaming data management module named TransferQueue to route fine-grained data dependencies across RL tasks dynamically.
This data management module decouples the data dependency across RL tasks, effectively reducing the complexity of framework design. Moreover, it enables automatic pipeline overlapping among RL tasks, and provides a general entry point for batch load-balance, thereby resulting in a higher throughput.
The detailed motivation and architecture design can be found in this paper: AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training.
Proposed Design
Architecture Overview
As illustrated below, TransferQueue acts as a streaming data scheduler bridging the training and inference tasks, managing the entire dataflow in the RL post-training process. The control plane maintains the fine-grained metadata of each training sample, while the data plane stores the actual data in a distributed manner.
TransferQueue employs a 2D data structure as follows:
This data structure design is motivated by the computational characteristics of the RL training process, where each training sample is generated in a relayed manner across RL task pipelines. It provides an accurate addressing capability, which allows fine-grained, concurrent data read/write operations in a streaming manner.
In the control plane, we track the production and consumption status of each training sample as metadata. When all the required computational tasks are completed (i.e., all column flags are set to ✅ for a given row), we know that this data sample can be consumed by downstream tasks. Note that each RL task requires different inputs, so we deploy task-specific controllers for each RL task. For example, reference inference only needs the prompt, while actor update requires prompt, response, old log prob, etc.)
To illustrate the interaction workflow, we use data retrieval as an example:
To simplify the usage of TransferQueue, we have encapsulated its capabilities into a PyTorch DataLoader. Please refer to the code example section.
Code Example
Initialization
We first showcase the init process of TransferQueue system. In this example, we hardcode the init process into the Trainer class for simplicity. In practice, we can encapsulate this process within a dedicated module, as it is decoupled from the Trainer class's core responsibilities.
TransferQueue supports diverse reinforcement learning algorithms without requiring internal code modifications. Specifically, it is initialized using a user-defined data dependency specification, which dynamically allocates empty storage and metadata spaces for training samples.
Core Training Logic
With the help of TransferQueue, we can decouple explicit data dependencies across RL tasks. Now the core training logic can be simplified as follows:
Interaction Interface
To simplify the usage of TransferQueue, we provide a DataLoader interface that abstracts the underlying complexity. This design allows users to treat TransferQueue as a standard iterator, enabling seamless integration into existing training and inference engines with familiar syntax.
Note that TransferQueue supports fine-grained micro-batch-level data retrieval. Combining with its automatic task workflow management capability, we can achieve a fully streaming data management as follows.
A Glimpse Inside TransferQueue
Now we showcase some inner logic of TransferQueue system. As discussed above, the
put_experience
andget_experience
have encapsulated the interaction process of TransferQueue.In TransferQueueController, we provide a load-balance interface that supports various strategies. For example, we can integrate the DP load-balance in veRL (i.e., the _balance_batch in ray_trainer.py), as well as more advanced strategies in this interface.
We present the core functionalities of the StreamingDataset abstraction below. The current implementation serves as a reference design. We are actively refining the abstraction to enhance usability and flexibility in future iterations.
In the future, we plan to remove the dependency of Ray to reduce serialization and communication overhead. Currently, we have implemented several adaptations to address these issues, but eventually we aim to propose a more general dataflow management module. The system will introduce the following core components:
Discussion
The next phase of LLM post-training (multi-agent, tool use, etc.) will introduce extra system components and lead to complex data dependencies. This evolution necessitates a centralized, general-purpose data management module to dynamically connect these components.
Through preliminary discussions with the verl team, we have learnt that the future verl framework will evolve toward a modular architecture (PR1977), which has brought such centralized data management systems onto the development agenda. By sharing our design and implementation experiences, we hope to address these emerging challenges together with the community.
CC
@vermouth1992 @ETOgaosion @ccclyu @wconstab @lxg2015 @mori360 @weifengpy @PeterSH6 @yushengsu-thu @Chendong98 @as12138
Beta Was this translation helpful? Give feedback.
All reactions