Skip to content

Conversation

0oshowero0
Copy link
Collaborator

@0oshowero0 0oshowero0 commented Sep 29, 2025

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

As the title

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@0oshowero0 0oshowero0 requested a review from Copilot September 29, 2025 08:47
Signed-off-by: 0oshowero0 <[email protected]>
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a general decorator for converting between DataProto and BatchMeta objects to enable TransferQueue integration. The decorator wraps functions that operate on DataProto objects so they can work with BatchMeta and the TransferQueue system.

Key changes:

  • Implements dataproto_batchmeta_conversion decorator that handles conversion from BatchMeta to DataProto, function execution, and result conversion back to BatchMeta
  • Provides both synchronous and asynchronous wrappers with client-based data retrieval or mock data generation for testing
  • Includes comprehensive test suite validating decorator functionality with real DataProto instances and mock TransferQueue components

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
recipe/transfer_queue/dataproto_conversion.py Core decorator implementation with conversion utilities and client integration
recipe/transfer_queue/test_dataproto_decorator.py Independent test script demonstrating decorator usage with DataProto and mock TransferQueue

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

data = await _batchmeta_to_dataproto_async(batch_meta, client)

# Call function with DataProto
result_data = await func(data, *other_args, **other_kwargs)
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is being awaited but may not be async. The wrapper assumes func is async but should check if it's a coroutine function first, or handle both sync and async functions appropriately.

Copilot uses AI. Check for mistakes.

Comment on lines 127 to 131
# We're in a running loop, this shouldn't happen for sync wrapper
raise RuntimeError("Sync wrapper called from within async context")
except RuntimeError:
# No running loop, we can use asyncio.run
data_dict = asyncio.run(client.async_get_data(batch_meta))
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is flawed - asyncio.get_running_loop() raises RuntimeError when no loop is running, but the code catches ALL RuntimeError exceptions. This could mask the intentionally raised error on line 127. Use asyncio.get_running_loop() return value instead of exception handling.

Suggested change
# We're in a running loop, this shouldn't happen for sync wrapper
raise RuntimeError("Sync wrapper called from within async context")
except RuntimeError:
# No running loop, we can use asyncio.run
data_dict = asyncio.run(client.async_get_data(batch_meta))
except RuntimeError:
# No running loop, we can use asyncio.run
data_dict = asyncio.run(client.async_get_data(batch_meta))
else:
# We're in a running loop, this shouldn't happen for sync wrapper
raise RuntimeError("Sync wrapper called from within async context")

Copilot uses AI. Check for mistakes.

Comment on lines 271 to 279
try:
return TensorDict(**tensor_dict, batch_size=len(data))
except Exception as e:
logger.warning(f"TensorDict creation failed: {e}, trying fallback")
# Fallback: create with batch_size parameter
td = TensorDict({}, batch_size=len(data))
for key, value in tensor_dict.items():
td.set(key, value)
return td
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching broad Exception makes debugging difficult. Consider catching specific TensorDict-related exceptions or at minimum log the specific exception type and tensor_dict contents for better debugging.

Copilot uses AI. Check for mistakes.

Comment on lines 280 to 291


def dataproto_batchmeta_conversion_v2(func: Optional[Callable] = None, *, transfer_queue_client: Optional[AsyncTransferQueueClient] = None):
"""
Alternative decorator syntax that supports both @decorator and @decorator() usage.
"""
def decorator(f: Callable) -> Callable:
return dataproto_batchmeta_conversion(transfer_queue_client)(f)

if func is not None:
return decorator(func)
return decorator No newline at end of file
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _v2 function appears to be unused and provides the same functionality as the main decorator. Consider removing this duplicate implementation to reduce code complexity.

Suggested change
def dataproto_batchmeta_conversion_v2(func: Optional[Callable] = None, *, transfer_queue_client: Optional[AsyncTransferQueueClient] = None):
"""
Alternative decorator syntax that supports both @decorator and @decorator() usage.
"""
def decorator(f: Callable) -> Callable:
return dataproto_batchmeta_conversion(transfer_queue_client)(f)
if func is not None:
return decorator(func)
return decorator

Copilot uses AI. Check for mistakes.

Signed-off-by: 0oshowero0 <[email protected]>
@0oshowero0 0oshowero0 requested a review from Copilot September 29, 2025 08:58
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 77 to 78
# Call function with DataProto
result_data = await func(data, *other_args, **other_kwargs)
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function func is being awaited but there's no guarantee it's a coroutine. This will fail if the wrapped function is synchronous. The check on line 101 should determine which wrapper to use, but this async wrapper shouldn't call non-async functions with await.

Copilot uses AI. Check for mistakes.

Comment on lines 132 to 134
# We're in a running loop, use run_coroutine_threadsafe
future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)
data_dict = future.result(timeout=10) # 10 second timeout
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using asyncio.run_coroutine_threadsafe with the current running loop will likely cause deadlock. When there's already a running event loop, you should use await instead of trying to run the coroutine in the same loop from a different thread context.

Suggested change
# We're in a running loop, use run_coroutine_threadsafe
future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)
data_dict = future.result(timeout=10) # 10 second timeout
# We're in a running event loop in this thread; cannot safely run coroutine synchronously.
raise RuntimeError(
"Cannot call _batchmeta_to_dataproto_sync when an event loop is running in this thread. "
"Use the async version (_batchmeta_to_dataproto_async) instead."
)

Copilot uses AI. Check for mistakes.

Comment on lines 197 to 205
loop = asyncio.get_running_loop()
except RuntimeError:
# No running loop, we can use asyncio.run
asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))
else:
# We're in a running loop, use run_coroutine_threadsafe
future = asyncio.run_coroutine_threadsafe(client.async_put(data=output_tensor_dict, metadata=batch_meta), loop)
future.result(timeout=10) # 10 second timeout

Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as with async_get_data - using asyncio.run_coroutine_threadsafe with the current running loop will likely cause deadlock. This pattern is problematic when already inside an event loop.

Suggested change
loop = asyncio.get_running_loop()
except RuntimeError:
# No running loop, we can use asyncio.run
asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))
else:
# We're in a running loop, use run_coroutine_threadsafe
future = asyncio.run_coroutine_threadsafe(client.async_put(data=output_tensor_dict, metadata=batch_meta), loop)
future.result(timeout=10) # 10 second timeout
asyncio.get_running_loop()
except RuntimeError:
# No running loop, we can use asyncio.run
asyncio.run(client.async_put(data=output_tensor_dict, metadata=batch_meta))
else:
# We're in a running event loop in this thread; cannot safely run async code synchronously.
raise RuntimeError(
"Cannot call _update_batchmeta_with_result_sync while an event loop is running in this thread. "
"Use _update_batchmeta_with_result_async instead."
)

Copilot uses AI. Check for mistakes.

Comment on lines 141 to 150
for field_name in batch_meta.field_names:
if field_name == "input_ids":
data_dict[field_name] = torch.randint(0, 1000, (batch_size, 10))
elif field_name == "attention_mask":
data_dict[field_name] = torch.ones(batch_size, 10)
elif field_name == "responses":
data_dict[field_name] = torch.randint(0, 1000, (batch_size, 5))
else:
# Generic mock data
data_dict[field_name] = torch.ones(batch_size, 5)
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock data generation logic is duplicated between sync and async versions (lines 141-150 and 170-179). This should be extracted into a separate helper function to avoid code duplication.

Copilot uses AI. Check for mistakes.

Comment on lines 237 to 244
# Test with client in a separate thread to avoid event loop issues
print("\n2. Testing compute_response_mask decorator with client...")
try:
# Run in a separate thread to avoid event loop conflicts
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(compute_response_mask_decorated, batch_meta, transfer_queue_client=mock_client)
result_batch_meta = future.result(timeout=10)
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ThreadPoolExecutor to avoid event loop issues indicates a design problem with the decorator. The decorator should handle async/sync contexts properly without requiring thread workarounds in tests.

Suggested change
# Test with client in a separate thread to avoid event loop issues
print("\n2. Testing compute_response_mask decorator with client...")
try:
# Run in a separate thread to avoid event loop conflicts
import concurrent.futures
with concurrent.futures.ThreadPoolExecutor() as executor:
future = executor.submit(compute_response_mask_decorated, batch_meta, transfer_queue_client=mock_client)
result_batch_meta = future.result(timeout=10)
# Test with client, handle async/sync context properly
print("\n2. Testing compute_response_mask decorator with client...")
try:
result = compute_response_mask_decorated(batch_meta, transfer_queue_client=mock_client)
if asyncio.iscoroutine(result):
result_batch_meta = await result
else:
result_batch_meta = result

Copilot uses AI. Check for mistakes.

else:
# We're in a running loop, use run_coroutine_threadsafe
future = asyncio.run_coroutine_threadsafe(client.async_get_data(batch_meta), loop)
data_dict = future.result(timeout=10) # 10 second timeout
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 10-second timeout is a magic number that appears multiple times (lines 134, 204). This should be defined as a constant at the module level for better maintainability.

Copilot uses AI. Check for mistakes.

Signed-off-by: 0oshowero0 <[email protected]>
@0oshowero0 0oshowero0 requested a review from Copilot September 29, 2025 09:30
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if "no running event loop" in str(e):
# No running loop, we can use asyncio.run
asyncio.run(
client.async_put(data=output_tensor_dict, metadata=batch_meta), timeout=DEFAULT_ASYNC_TIMEOUT
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The asyncio.run() function doesn't accept a timeout parameter. The timeout parameter belongs to asyncio.wait_for(). This should be wrapped in asyncio.wait_for() similar to the async version.

Suggested change
client.async_put(data=output_tensor_dict, metadata=batch_meta), timeout=DEFAULT_ASYNC_TIMEOUT
asyncio.wait_for(
client.async_put(data=output_tensor_dict, metadata=batch_meta),
timeout=DEFAULT_ASYNC_TIMEOUT
)

Copilot uses AI. Check for mistakes.

try:
return TensorDict(**tensor_dict, batch_size=len(data))
except Exception as e:
logger.warning(f"TensorDict creation failed: {e}, trying fallback")
Copy link
Preview

Copilot AI Sep 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable logger is not defined in this scope. The logger should be imported at the module level or the function should use print() for error output.

Suggested change
logger.warning(f"TensorDict creation failed: {e}, trying fallback")
print(f"TensorDict creation failed: {e}, trying fallback")

Copilot uses AI. Check for mistakes.

@0oshowero0 0oshowero0 merged commit eb31070 into main_tq_submodule Sep 29, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant