[data] feat: Provide general decorator for DataProto <-> BatchMeta #21

0oshowero0 · 2025-09-29T08:23:45Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

As the title

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Signed-off-by: 0oshowero0 <[email protected]>

Copilot

Pull Request Overview

This PR introduces a general decorator for converting between DataProto and BatchMeta objects to enable TransferQueue integration. The decorator wraps functions that operate on DataProto objects so they can work with BatchMeta and the TransferQueue system.

Key changes:

Implements dataproto_batchmeta_conversion decorator that handles conversion from BatchMeta to DataProto, function execution, and result conversion back to BatchMeta
Provides both synchronous and asynchronous wrappers with client-based data retrieval or mock data generation for testing
Includes comprehensive test suite validating decorator functionality with real DataProto instances and mock TransferQueue components

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
recipe/transfer_queue/dataproto_conversion.py	Core decorator implementation with conversion utilities and client integration
recipe/transfer_queue/test_dataproto_decorator.py	Independent test script demonstrating decorator usage with DataProto and mock TransferQueue

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

recipe/transfer_queue/test_dataproto_decorator.py

Copilot · 2025-09-29T08:48:59Z

recipe/transfer_queue/dataproto_conversion.py

+            data = await _batchmeta_to_dataproto_async(batch_meta, client)
+
+            # Call function with DataProto
+            result_data = await func(data, *other_args, **other_kwargs)


The function is being awaited but may not be async. The wrapper assumes func is async but should check if it's a coroutine function first, or handle both sync and async functions appropriately.

Copilot · 2025-09-29T08:48:59Z

recipe/transfer_queue/dataproto_conversion.py

+            # We're in a running loop, this shouldn't happen for sync wrapper
+            raise RuntimeError("Sync wrapper called from within async context")
+        except RuntimeError:
+            # No running loop, we can use asyncio.run
+            data_dict = asyncio.run(client.async_get_data(batch_meta))


This logic is flawed - asyncio.get_running_loop() raises RuntimeError when no loop is running, but the code catches ALL RuntimeError exceptions. This could mask the intentionally raised error on line 127. Use asyncio.get_running_loop() return value instead of exception handling.

Suggested change

# We're in a running loop, this shouldn't happen for sync wrapper

raise RuntimeError("Sync wrapper called from within async context")

except RuntimeError:

# No running loop, we can use asyncio.run

data_dict = asyncio.run(client.async_get_data(batch_meta))

except RuntimeError:

# No running loop, we can use asyncio.run

data_dict = asyncio.run(client.async_get_data(batch_meta))

else:

# We're in a running loop, this shouldn't happen for sync wrapper

raise RuntimeError("Sync wrapper called from within async context")

Copilot · 2025-09-29T08:48:59Z

recipe/transfer_queue/dataproto_conversion.py

+    try:
+        return TensorDict(**tensor_dict, batch_size=len(data))
+    except Exception as e:
+        logger.warning(f"TensorDict creation failed: {e}, trying fallback")
+        # Fallback: create with batch_size parameter
+        td = TensorDict({}, batch_size=len(data))
+        for key, value in tensor_dict.items():
+            td.set(key, value)
+        return td


Catching broad Exception makes debugging difficult. Consider catching specific TensorDict-related exceptions or at minimum log the specific exception type and tensor_dict contents for better debugging.

Copilot · 2025-09-29T08:48:59Z

recipe/transfer_queue/dataproto_conversion.py

+
+
+def dataproto_batchmeta_conversion_v2(func: Optional[Callable] = None, *, transfer_queue_client: Optional[AsyncTransferQueueClient] = None):
+    """
+    Alternative decorator syntax that supports both @decorator and @decorator() usage.
+    """
+    def decorator(f: Callable) -> Callable:
+        return dataproto_batchmeta_conversion(transfer_queue_client)(f)
+
+    if func is not None:
+        return decorator(func)
+    return decorator


The _v2 function appears to be unused and provides the same functionality as the main decorator. Consider removing this duplicate implementation to reduce code complexity.

Suggested change

def dataproto_batchmeta_conversion_v2(func: Optional[Callable] = None, *, transfer_queue_client: Optional[AsyncTransferQueueClient] = None):

"""

Alternative decorator syntax that supports both @decorator and @decorator() usage.

"""

def decorator(f: Callable) -> Callable:

return dataproto_batchmeta_conversion(transfer_queue_client)(f)

if func is not None:

return decorator(func)

return decorator

Signed-off-by: 0oshowero0 <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-29T09:01:44Z

recipe/transfer_queue/dataproto_conversion.py

+            # Call function with DataProto
+            result_data = await func(data, *other_args, **other_kwargs)


The function func is being awaited but there's no guarantee it's a coroutine. This will fail if the wrapped function is synchronous. The check on line 101 should determine which wrapper to use, but this async wrapper shouldn't call non-async functions with await.

Copilot · 2025-09-29T09:01:44Z

recipe/transfer_queue/dataproto_conversion.py

+            # We're in a running loop, use run_coroutine_threadsafe
+            future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)
+            data_dict = future.result(timeout=10)  # 10 second timeout


Using asyncio.run_coroutine_threadsafe with the current running loop will likely cause deadlock. When there's already a running event loop, you should use await instead of trying to run the coroutine in the same loop from a different thread context.

Suggested change

# We're in a running loop, use run_coroutine_threadsafe

future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)

data_dict = future.result(timeout=10) # 10 second timeout

# We're in a running event loop in this thread; cannot safely run coroutine synchronously.

raise RuntimeError(

"Cannot call _batchmeta_to_dataproto_sync when an event loop is running in this thread. "

"Use the async version (_batchmeta_to_dataproto_async) instead."

)

Copilot · 2025-09-29T09:01:44Z

recipe/transfer_queue/dataproto_conversion.py

+            loop = asyncio.get_running_loop()
+        except RuntimeError:
+            # No running loop, we can use asyncio.run
+            asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))
+        else:
+            # We're in a running loop, use run_coroutine_threadsafe
+            future = asyncio.run_coroutine_threadsafe(client.async_put(data=output_tensor_dict, metadata=batch_meta), loop)
+            future.result(timeout=10)  # 10 second timeout
+


Same issue as with async_get_data - using asyncio.run_coroutine_threadsafe with the current running loop will likely cause deadlock. This pattern is problematic when already inside an event loop.

Suggested change

loop = asyncio.get_running_loop()

except RuntimeError:

# No running loop, we can use asyncio.run

asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))

else:

# We're in a running loop, use run_coroutine_threadsafe

future = asyncio.run_coroutine_threadsafe(client.async_put(data=output_tensor_dict, metadata=batch_meta), loop)

future.result(timeout=10) # 10 second timeout

asyncio.get_running_loop()

except RuntimeError:

# No running loop, we can use asyncio.run

asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))

else:

# We're in a running event loop in this thread; cannot safely run async code synchronously.

raise RuntimeError(

"Cannot call _update_batchmeta_with_result_sync while an event loop is running in this thread. "

"Use _update_batchmeta_with_result_async instead."

)

Copilot · 2025-09-29T09:01:45Z

recipe/transfer_queue/dataproto_conversion.py

+        for field_name in batch_meta.field_names:
+            if field_name == "input_ids":
+                data_dict[field_name] = torch.randint(0, 1000, (batch_size, 10))
+            elif field_name == "attention_mask":
+                data_dict[field_name] = torch.ones(batch_size, 10)
+            elif field_name == "responses":
+                data_dict[field_name] = torch.randint(0, 1000, (batch_size, 5))
+            else:
+                # Generic mock data
+                data_dict[field_name] = torch.ones(batch_size, 5)


The mock data generation logic is duplicated between sync and async versions (lines 141-150 and 170-179). This should be extracted into a separate helper function to avoid code duplication.

Copilot · 2025-09-29T09:01:45Z

recipe/transfer_queue/test_dataproto_decorator.py

+    # Test with client in a separate thread to avoid event loop issues
+    print("\n2. Testing compute_response_mask decorator with client...")
+    try:
+        # Run in a separate thread to avoid event loop conflicts
+        import concurrent.futures
+        with concurrent.futures.ThreadPoolExecutor() as executor:
+            future = executor.submit(compute_response_mask_decorated, batch_meta, transfer_queue_client=mock_client)
+            result_batch_meta = future.result(timeout=10)


Using ThreadPoolExecutor to avoid event loop issues indicates a design problem with the decorator. The decorator should handle async/sync contexts properly without requiring thread workarounds in tests.

Suggested change

# Test with client in a separate thread to avoid event loop issues

print("\n2. Testing compute_response_mask decorator with client...")

try:

# Run in a separate thread to avoid event loop conflicts

import concurrent.futures

with concurrent.futures.ThreadPoolExecutor() as executor:

future = executor.submit(compute_response_mask_decorated, batch_meta, transfer_queue_client=mock_client)

result_batch_meta = future.result(timeout=10)

# Test with client, handle async/sync context properly

print("\n2. Testing compute_response_mask decorator with client...")

try:

result = compute_response_mask_decorated(batch_meta, transfer_queue_client=mock_client)

if asyncio.iscoroutine(result):

result_batch_meta = await result

else:

result_batch_meta = result

Copilot · 2025-09-29T09:01:45Z

recipe/transfer_queue/dataproto_conversion.py

+        else:
+            # We're in a running loop, use run_coroutine_threadsafe
+            future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)
+            data_dict = future.result(timeout=10)  # 10 second timeout


The 10-second timeout is a magic number that appears multiple times (lines 134, 204). This should be defined as a constant at the module level for better maintainability.

Signed-off-by: 0oshowero0 <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-29T09:31:02Z

recipe/transfer_queue/dataproto_conversion.py

+            if "no running event loop" in str(e):
+                # No running loop, we can use asyncio.run
+                asyncio.run(
+                    client.async_put(data=output_tensor_dict, metadata=batch_meta), timeout=DEFAULT_ASYNC_TIMEOUT


The asyncio.run() function doesn't accept a timeout parameter. The timeout parameter belongs to asyncio.wait_for(). This should be wrapped in asyncio.wait_for() similar to the async version.

Suggested change

client.async_put(data=output_tensor_dict, metadata=batch_meta), timeout=DEFAULT_ASYNC_TIMEOUT

asyncio.wait_for(

client.async_put(data=output_tensor_dict, metadata=batch_meta),

timeout=DEFAULT_ASYNC_TIMEOUT

)

Copilot · 2025-09-29T09:31:03Z

recipe/transfer_queue/test_dataproto_decorator.py

+    try:
+        return TensorDict(**tensor_dict, batch_size=len(data))
+    except Exception as e:
+        logger.warning(f"TensorDict creation failed: {e}, trying fallback")


The variable logger is not defined in this scope. The logger should be imported at the module level or the function should use print() for error output.

Suggested change

logger.warning(f"TensorDict creation failed: {e}, trying fallback")

print(f"TensorDict creation failed: {e}, trying fallback")

Signed-off-by: 0oshowero0 <[email protected]>

0oshowero0 added 5 commits September 29, 2025 09:14

fix chinese comments & add TODO

6eab275

provide general DataProto<->BatchMeta decorator

c87a7a6

Signed-off-by: 0oshowero0 <[email protected]>

Merge branch 'main_tq_submodule' into han/main_tq_submodule

7aba92f

Merge branch 'main_tq_submodule' into han/main_tq_submodule

9dc427b

fix

5d9cf9d

Signed-off-by: 0oshowero0 <[email protected]>

0oshowero0 requested a review from Copilot September 29, 2025 08:47

fix

cbd3907

Signed-off-by: 0oshowero0 <[email protected]>

Copilot AI reviewed Sep 29, 2025

View reviewed changes

fix

616a8fa

Signed-off-by: 0oshowero0 <[email protected]>

0oshowero0 requested a review from Copilot September 29, 2025 08:58

Copilot AI reviewed Sep 29, 2025

View reviewed changes

optimize code

ebc1bed

Signed-off-by: 0oshowero0 <[email protected]>

0oshowero0 requested a review from Copilot September 29, 2025 09:30

Copilot AI reviewed Sep 29, 2025

View reviewed changes

0oshowero0 added 4 commits September 29, 2025 17:41

fix

1198419

Signed-off-by: 0oshowero0 <[email protected]>

Merge branch 'main_tq_submodule' into han/main_tq_submodule

254ef97

Merge branch 'main_tq_submodule' into han/main_tq_submodule

34b08d2

fix

5aabe8a

Signed-off-by: 0oshowero0 <[email protected]>

0oshowero0 merged commit eb31070 into main_tq_submodule Sep 29, 2025
3 of 4 checks passed

		# Call function with DataProto
		result_data = await func(data, other_args, *other_kwargs)

	logger.warning(f"TensorDict creation failed: {e}, trying fallback")
	print(f"TensorDict creation failed: {e}, trying fallback")

[data] feat: Provide general decorator for DataProto <-> BatchMeta #21

[data] feat: Provide general decorator for DataProto <-> BatchMeta #21

Uh oh!

Conversation

0oshowero0 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

0oshowero0 commented Sep 29, 2025 •

edited

Loading