[RFC] A centralized data management module for fine-grained dataflow management across RL tasks #2662

0oshowero0 · 2025-07-21T12:14:10Z

0oshowero0
Jul 21, 2025

A Chinese version (not an identical translation) can be found at Zhihu ：）

Motivation

We observe that the recently proposed RL frameworks for LLM post-training mainly adopt a task-separated paradigm. Distinct RL tasks such as actor rollout, reference inference, and actor update are distributed across separate hardware resources.

Framework	Publish Time	Architecture
DeepSpeed-Chat	2023	Task-Collocated
NeMo-Aligner	2024	Task-Collocated
RLHFuse	2024	Task-Collocated
verl	2024	Task-Collocated
OpenRLHF	2024	Task-Separated
K1.5	2025	Task-Collocated
SeedThinking-1.5	2025	Task-Separated
StreamRL	2025	Task-Separated
AReal	2025	Task-Separated
slime	2025	Task-Separated/Collocated
siiRL	2025	Task-Separated

The core story lies in scaling efficiency. The emergence of reasoning models (e.g., OpenAI o1 and DeepSeek R1) has unveiled post-training scaling laws, resulting in significantly higher computational demands for RL training.

However, in task-collocated frameworks, the binding relationship between actor rollout and actor update significantly limits this scaling law, as their workloads exhibit distinct characteristics. Consequently, achieving equivalent throughput requires allocating varying numbers of computational resources to these tasks, as shown in Fig.2 of the StreamRL study.

The task-separated framework eliminates this constraint by allocating dedicated resources for each RL task, thus enabling efficient large-scale post-training while incurring significant challenges in data management. To reduce the pipeline bubbles in task-separated frameworks, we need to execute RL tasks in parallel. Current implementations of DataProto & Dispatch focus on block-level (global batch) data management across worker processes within a single RL task, which brings difficulties in fine-grained parallel execution at the micro-batch level across different RL tasks.

To address this challenge, we propose a centralized streaming data management module named TransferQueue to route fine-grained data dependencies across RL tasks dynamically.

This data management module decouples the data dependency across RL tasks, effectively reducing the complexity of framework design. Moreover, it enables automatic pipeline overlapping among RL tasks, and provides a general entry point for batch load-balance, thereby resulting in a higher throughput.

The detailed motivation and architecture design can be found in this paper: AsyncFlow: An Asynchronous Streaming RL Framework for Efficient LLM Post-Training.

Proposed Design

Architecture Overview

As illustrated below, TransferQueue acts as a streaming data scheduler bridging the training and inference tasks, managing the entire dataflow in the RL post-training process. The control plane maintains the fine-grained metadata of each training sample, while the data plane stores the actual data in a distributed manner.

TransferQueue employs a 2D data structure as follows:

Each row corresponds to a training sample, assigned a unique index within the corresponding global batch.
Each column represents the input/output data components for RL tasks. For example, we have prompt column, response column, ref log prob column, etc.

This data structure design is motivated by the computational characteristics of the RL training process, where each training sample is generated in a relayed manner across RL task pipelines. It provides an accurate addressing capability, which allows fine-grained, concurrent data read/write operations in a streaming manner.

In the control plane, we track the production and consumption status of each training sample as metadata. When all the required computational tasks are completed (i.e., all column flags are set to ✅ for a given row), we know that this data sample can be consumed by downstream tasks. Note that each RL task requires different inputs, so we deploy task-specific controllers for each RL task. For example, reference inference only needs the prompt, while actor update requires prompt, response, old log prob, etc.)

To illustrate the interaction workflow, we use data retrieval as an example:

A DP group sends a read request to the corresponding TransferQueue controller.
TransferQueue controller scans the production and consumption metadata for each sample (row), and dynamically assembles a micro-batch metadata according to the load-balancing policy. This mechanism enables sample-level data scheduling.
The DP group retrieves the actual data from distributed storage units using the metadata provided by the controller.

To simplify the usage of TransferQueue, we have encapsulated its capabilities into a PyTorch DataLoader. Please refer to the code example section.

Code Example

Initialization

We first showcase the init process of TransferQueue system. In this example, we hardcode the init process into the Trainer class for simplicity. In practice, we can encapsulate this process within a dedicated module, as it is decoupled from the Trainer class's core responsibilities.

TransferQueue supports diverse reinforcement learning algorithms without requiring internal code modifications. Specifically, it is initialized using a user-defined data dependency specification, which dynamically allocates empty storage and metadata spaces for training samples.

class RayGRPOTrainer():
    def __init__():
        ...
        # Define the data dependency for each RL task
        self.data_dependency = {
            'actor_rollout': ['prompts', 'prompt_length'],
            'actor_log_prob': ['input_ids', 'responses', 'response_length', 'prompt_length'],
            'ref_log_prob': ['input_ids', 'responses', 'response_length', 'prompt_length'],
            'actor_train': ['responses', 'advantages', 'old_log_prob', 'ref_log_prob',
                            'input_ids', 'response_length', 'prompt_length'],
            'compute_advantage': ["responses", "token_level_rewards", "response_length"],
            'rule_reward': ['prompts', 'responses', 'response_length'],
            'reward_scores': ['input_ids', 'prompt_length', "responses", "response_length"],
            'grpo_metrics': ["responses", "token_level_rewards", "response_length",
                            "rm_scores", "advantages", "returns", "prompt_length"],
        }

        # initialize the TransferQueue sustem
        self.transfer_queue_init()

    def transfer_queue_init(self):
        ...
        # Init TransferQueue Storage
        transfer_queue_storage_nodes = []
        for storage_node_rank in range(num_storage_nodes):
            # This will trigger a remote call based on Ray. storage_node is a handler.
            storage_node = TransferQueueStorage(GBS,
                                                storage_node_rank,
                                                num_storage_nodes,
                                                self.consumers_columns)
            transfer_queue_storage_nodes.append(storage_node)
        self.transfer_queue_storage_nodes = transfer_queue_storage_nodes
        
        # Init TransferQueue Controller.
        # This is also a remote call, with handlers return.
        self.transfer_queue_controllers = self.init_transfer_queue_controller(GBS, self.consumers_columns)
    
        # Register TransferQueue Controller for each storage node
        for node in self.transfer_queue_storage_nodes:
            ray.get(node.handler.register_tq_controllers.remote(self.td_controllers))

Core Training Logic

With the help of TransferQueue, we can decouple explicit data dependencies across RL tasks. Now the core training logic can be simplified as follows:

Start all the RL tasks asynchronously. These tasks will request input data from TransferQueue.
Feed the first global batch of prompts into TransferQueue.
RL tasks successfully retrieve the required data from TransferQueue and initiate their computational workloads. The pipeline is automatically formed without any human effort.
The training logic detects the completion of the current iteration, records performance metrics, evicts the current global batch, and puts new prompts into TransferQueue.

class RayGRPOTrainer():
    def fit(self, data_iters):
        iteration = 0
        # 1. Start all the RL tasks. Each task tries to pull data from TransferQueue by themselves.
        
        # generate sequences
        self.rollout_worker.start(iteration, self.train_iters)
        # compute reference log_prob
        self.ref_worker.start(iteration, self.train_iters)
        # compute rm scores.
        rule_reward = []
        for reward_worker in self.reward_list:
            if isinstance(reward_worker, RayActorGroup):
                reward_worker.start(iteration, self.train_iters)
            else:
                rule_reward.append(reward_worker.start.remote(iteration, self.train_iters))
        # compute advantage
        for advantage in self.advantage_list:
            advantage.start.remote(iteration, self.train_iters)
        # compute old log_prob
        if self.actor_fwd_worker:
            self.actor_fwd_worker.start(iteration, self.train_iters)
        # update actor
        self.actor_worker.start(iteration, self.train_iters)
    
        # 2. Put the first global batch of prompts into TransferQueue
        start_iter_time = time.time()
        for _ in range(self.staleness_threshold):
            batch = next(data_iters)
            put_prompts(batch, self.n_samples_per_prompt, total_data_rows)
    
        # 3. Iteration control
        while True:
            train_iteration = self.iteration_record.get() # iteration_record can be a Queue
    
            # Do metric update
            ...
    
            # Clear the corresponding TransferQueue
            self.clear_tq_controllers(train_iteration, indexes)
    
            # Put new prompts into TransferQueue
            if data_loader_index < self.train_iters:
                batch = next(data_iters)
                put_prompts(self.metrics_tq, batch, self.n_samples_per_prompt, self.dataset_additional_keys, indexes[0])
                data_loader_index += 1
            iteration += 1
    
            # Stop training
            if iteration >= self.train_iters:
                logger.info(f"The threshold of train iteration:{self.train_iters} is reached, stop putting prompt to TQ")
                ray.shutdown()
                break

Interaction Interface

To simplify the usage of TransferQueue, we provide a DataLoader interface that abstracts the underlying complexity. This design allows users to treat TransferQueue as a standard iterator, enabling seamless integration into existing training and inference engines with familiar syntax.

class RolloutWorker(BaseWorker):
    def generate_sequences(self):
        experience_consumer = 'actor_rollout'
        experience_columns = ['prompts', 'prompt_length']
        experience_count = self.rl_config.rollout_dispatch_size  # can be set to MBS
    
        data_loader = self.create_stream_data_loader(
            experience_consumer=experience_consumer,
            experience_columns=experience_columns,
            experience_count=experience_count,
            order_preserving_flag=self.rl_config.guarantee_order
        )
        data_iter = iter(data_loader)
    
        for batch_data, index in data_iter:
            indexes = list(range(0, experience_count, self.rl_config.n_samples_per_prompt))
            prompts_data = batch_data['prompts'][indexes]
    
            # Do Inference
            responses = self.rollout.generate_sequences(prompts_data)[0]
    
            # Prepare Output
            input_ids_list = []
            for prompt, response in zip(prompts_data, responses):
                input_ids_list.append(torch.cat((prompt, response), dim=0))
    
            outputs = {
                'responses': responses,
                'input_ids': input_ids_list,
            }
    
            # write results back to TransferQueue
            self.collect_transfer_queue_data(outputs, index)

class BaseWorker:
    def collect_transfer_queue_data(self, output, index):
        if is_pipeline_last_stage(self.parallel_state) and get_tensor_model_parallel_rank(self.parallel_state) == 0:
            output = {key: value.cpu() if not isinstance(value, List) else value for key, value in output.items()}

            # Write results back to TransferQueue. This is a utility function that encapsulates
            # the logic of putting data into the TransferQueue. In the future, we may propose a dedicated
            # client interface for this.
            put_experience(self.tq_controller, data_dict=output, indexes=index)

Note that TransferQueue supports fine-grained micro-batch-level data retrieval. Combining with its automatic task workflow management capability, we can achieve a fully streaming data management as follows.

A Glimpse Inside TransferQueue

Now we showcase some inner logic of TransferQueue system. As discussed above, the put_experience and get_experience have encapsulated the interaction process of TransferQueue.

def put_experience(controller_handler: ray.actor.ActorHandle,
                      data_dict: Dict[str, Union[Tensor, List[Tensor]]],
                      indexes: List[int]) -> None:

    # split the batch block into samples
    experience_columns, experience = trans_input_into_experience(data_dict)

    # get metadata from TransferQueue controller
    metadata_list = ray.get(controller_handler.get_metadata_on_writing.remote(indexes))
    if metadata_list is None:
        logger.warning("put experience return None.")
        return

    # write actural data into distributed storage node
    experience_data = transfer_to_tensor_dict(experience_columns, experience)

    # for simplicity, we use ray to conduct data transfer. In the future, we will
    # integrate with a more efficient data transfer mechanism and supports more 
    # data storage backends.
    ray.get([metadata.handler.put_data.remote(
        experience_data, metadata.experience_offset, metadata.indexes)
        for metadata in metadata_list])

In TransferQueueController, we provide a load-balance interface that supports various strategies. For example, we can integrate the DP load-balance in veRL (i.e., the _balance_batch in ray_trainer.py), as well as more advanced strategies in this interface.

class TransferQueueController:
    def get_metadata_on_reading(self, consumer: str, experience_count: int, policy='DP_balance'):
        meta_data = self.sample_batch_metadata(consumer, experience_count, policy)
        return meta_data

    def sample_batch_metadata(self, consumer, experience_count, policy):
        # 1. scan all the samples that are ready to be consumed
        ready_index = self.get_ready_sample_index()
        # 2. use load balance sampler to sample a batch of metadata
        batch_meta = self.load_balance_sampler(ready_index, consumer, experience_count, policy)
        return batch_meta

We present the core functionalities of the StreamingDataset abstraction below. The current implementation serves as a reference design. We are actively refining the abstraction to enhance usability and flexibility in future iterations.

class StreamingDataset(IterableDataset):
    def _dispatch_transfer_queue_data_zmq(self, sorted_index):
        server_flag = self.tp_rank == 0

        batch_data = {}
        if server_flag:
            batch_data, index = get_experience(
                self.tq,
                self.experience_consumer_stage,
                self.experience_count,
                order_preserving_flag=self.order_preserving_flag,
                indexes=sorted_index,
                get_n_samples=self.get_n_samples,
                experience_range=self.experience_range)
            if not index:
                index = [-1] * self.experience_count

            send_index_dict = {}
            send_index_dict["index"] = torch.tensor(index)
            index_dict = self.zmq_server.dispatch_transfer_queue_data(send_index_dict)
        else:
            index_dict = self.zmq_client.wait_publisher_message()

        if self.pp_rank == 0:
            for other_pp_handler in self.other_pp_handlers:
                other_pp_handler.prefetch_index.remote(index_dict["index"].tolist())
        else:
            time_start = time.time()
            while self.prefetch_index_queue.empty():
                time.sleep(0.5)
                if time.time() - time_start > self.timeout:
                    raise(ValueError('Prefetch data out of time'))
            index = self.prefetch_index_queue.get()
            if index[0] == -1:
                return None, None
            batch_data, index = get_experience(
                self.tq,
                self.experience_consumer_stage,
                self.experience_count,
                experience_range=self.experience_range,
                indexes=index,
            )
            batch_data = pad_experience(batch_data, self.pad_id, self.pad_to_multiple_of)
            return batch_data, index

        if index_dict["index"].tolist()[0] == -1:
            return None, None

        if server_flag:
            batch_data, batch_data_length = pack_experience_columns(batch_data, self.experience_count)
            data_dict = self.zmq_server.dispatch_transfer_queue_data(batch_data)
            data_length_dict = self.zmq_server.dispatch_transfer_queue_data(batch_data_length)
        else:
            data_dict = self.zmq_client.wait_publisher_message()
            data_length_dict = self.zmq_client.wait_publisher_message()

        if not data_dict:
            return {}, []

        padded_batch_data = unpack_pad_experience(data_dict, data_length_dict,
                                                  self.pad_id, self.pad_to_multiple_of)
        return padded_batch_data, index_dict["index"].tolist()

    def _dispatch_transfer_queue_data_hccl(self, sorted_index):
        pass

    def _dispatch_transfer_queue_data(self):
        sorted_index = self.sorted_indexes.pop(0) if self.order_preserving_flag and self.sorted_indexes else None
        if self.zmq_server is not None or self.zmq_client is not None:
            return self._dispatch_transfer_queue_data_zmq(sorted_index)
        else:
            return self._dispatch_transfer_queue_data_hccl(sorted_index)

    def all_consumed(self):
        # this is a notification mechanism that determine whether the current global batch is all consumed.
        status = torch.tensor(0, device=torch.cuda.current_device())
        if self.tp_rank == 0 and self.pp_rank == 0:
            status = torch.tensor(int(not ray.get(
                self.tq.all_consumed.remote(self.experience_consumer_stage, experience_range=self.experience_range))),
                                  device=torch.cuda.current_device())

        torch.distributed.all_reduce(status, group=get_model_parallel_group(self.parallel_state, self.use_vllm),
                                     op=torch.distributed.ReduceOp.MAX)
        return status

    def __iter__(self) -> Iterator[Dict]:
        while self.all_consumed() > 0 or not self.data_cache.empty():
            if self.micro_batch_size != 0:
                if self.experience_count % self.micro_batch_size != 0:
                    raise ValueError(f"The experience count {self.experience_count} \
                                       must be divisible by the micro batch size {self.micro_batch_size}.")
                
                if self.data_cache.empty():
                    data_batch, indexes = self._dispatch_transfer_queue_data()
                    if data_batch and indexes:
                        batches_split = []
                        indexes_split = [indexes[i:i + self.micro_batch_size] for i in range(0, len(indexes), self.micro_batch_size)]
                        
                        for key, tensors in data_batch.items():
                            for index, tensor in enumerate(torch.split(tensors, self.micro_batch_size, dim=0)):
                                if index >= len(batches_split):
                                    batches_split.append({})
                                batches_split[index][key] = tensor
                        
                        for i in range(self.experience_count // self.micro_batch_size):
                            self.data_cache.put((batches_split[i], indexes_split[i]))
                else:
                    data_batch, index = self.data_cache.get()
                    yield data_batch, index
            
            else:
                data_batch, index = self._dispatch_transfer_queue_data()
                if data_batch and index:
                    yield data_batch, index
            
    def prefetch_index(self, index: List[int]):
        self.prefetch_index_queue.put(index)

class StreamDataLoader(torch.utils.data.DataLoader):
    def __init__(self, dataset: StreamingDataset):
        self.dataset = dataset
        super().__init__(dataset=self.dataset, collate_fn=_custom_collate)

    def prefetch_index(self, index: List[int]):
        self.dataset.prefetch_index(index)

    def sync_in_pp_group(self):
        self.dataset.sync_in_pp_group()

In the future, we plan to remove the dependency of Ray to reduce serialization and communication overhead. Currently, we have implemented several adaptations to address these issues, but eventually we aim to propose a more general dataflow management module. The system will introduce the following core components:

TransferQueueClient: Deployed on each Worker instance, manages the communication with TransferQueue system using simple put/get semantics. The client should support multiple communication libraries (e.g., ZMQ, TransferEngine) to suit diverse data transfer scenarios.
TransferQueueController: A centralized dataflow scheduler that tracks the production and consumption status of training samples. Maintain the current implementation.
TransferQueueStorage: A unified abstraction layer for distributed storage backends (e.g., Redis, MoonCake). It translates generic put/get operations into backend-specific implementations, ensuring storage-agnostic flexibility and scalability.

Discussion

The next phase of LLM post-training (multi-agent, tool use, etc.) will introduce extra system components and lead to complex data dependencies. This evolution necessitates a centralized, general-purpose data management module to dynamically connect these components.

Through preliminary discussions with the verl team, we have learnt that the future verl framework will evolve toward a modular architecture (PR1977), which has brought such centralized data management systems onto the development agenda. By sharing our design and implementation experiences, we hope to address these emerging challenges together with the community.

CC

@vermouth1992 @ETOgaosion @ccclyu @wconstab @lxg2015 @mori360 @weifengpy @PeterSH6 @yushengsu-thu @Chendong98 @as12138

0oshowero0 · 2025-07-21T12:36:57Z

0oshowero0
Jul 21, 2025
Author

A Chinese version (not an identical translation) can be found at Zhihu ：）

0 replies

eric-haibin-lin · 2025-07-26T17:17:29Z

eric-haibin-lin
Jul 26, 2025
Maintainer

what do you propose as the next step?

2 replies

0oshowero0 Jul 29, 2025
Author

Thank you for your attention! We have summarized an API list of the aforementioned data system, which aims to to enable pluggable, modular implementations that are readily deployable. Besides, we also propose the corresponding adaptations to verl in both short-term (task-collocated architecture) and long-term (fully task-separated with simplified single controller) manners.

Now we are preparing a recipe PR (in close discussion with Chi Zhang) built around short-term adaptations to showcase the data system's usage. Building on this, we can further discuss with the community how to smoothly evolve to a modular architecture to accommodate the new post-training workload for multi-agent RL.

0oshowero0 Jul 30, 2025
Author

I have modified the links so you can view the content without logging in :)

jamindy · 2025-07-26T17:18:00Z

jamindy
Jul 26, 2025

您好，已收到您的邮件，我会尽快给您回复。

0 replies

0oshowero0 · 2025-08-12T08:39:35Z

0oshowero0
Aug 12, 2025
Author

We have updated the API and adaptation as follows. Now we are developing a simple demo to showcase the usage of the data system :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] A centralized data management module for fine-grained dataflow management across RL tasks #2662

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[RFC] A centralized data management module for fine-grained dataflow management across RL tasks #2662

Uh oh!

Uh oh!

0oshowero0 Jul 21, 2025

Motivation

Proposed Design

Architecture Overview

Code Example

Initialization

Core Training Logic

Interaction Interface

A Glimpse Inside TransferQueue

Discussion

CC

Replies: 4 comments · 2 replies

Uh oh!

0oshowero0 Jul 21, 2025 Author

Uh oh!

eric-haibin-lin Jul 26, 2025 Maintainer

Uh oh!

Uh oh!

0oshowero0 Jul 29, 2025 Author

Uh oh!

0oshowero0 Jul 30, 2025 Author

Uh oh!

jamindy Jul 26, 2025

Uh oh!

0oshowero0 Aug 12, 2025 Author

0oshowero0
Jul 21, 2025

Replies: 4 comments 2 replies

0oshowero0
Jul 21, 2025
Author

eric-haibin-lin
Jul 26, 2025
Maintainer

0oshowero0 Jul 29, 2025
Author

0oshowero0 Jul 30, 2025
Author

jamindy
Jul 26, 2025

0oshowero0
Aug 12, 2025
Author