-
Notifications
You must be signed in to change notification settings - Fork 2.5k
[trainer] feat: Upstream Dynamic Sampling #2988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces dynamic sampling from DAPO into the main trainer to improve sample efficiency and training robustness. The core changes involve filtering rollout samples based on reward variance and backfilling mini-batches. A key addition is a new metric for pre-filtering rewards to provide more accurate logging, addressing a noted issue with critic-reward logging. The changes are primarily in verl/trainer/ppo/ray_trainer.py
, with corresponding updates to configuration files and metric utilities. My review identified a critical bug in the _extract_reward_extra_infos
helper function within ray_trainer.py
that would cause a runtime error. The rest of the implementation for dynamic sampling appears to be logically sound and aligns with the goals outlined in the description.
verl/trainer/ppo/ray_trainer.py
Outdated
if key not in standard_keys and isinstance((list, np.ndarray), values): | ||
reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arguments to isinstance
are swapped in two places on these lines. The correct usage is isinstance(object, classinfo)
. The current code will raise a TypeError
at runtime when this function is called.
if key not in standard_keys and isinstance((list, np.ndarray), values): | |
reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values | |
if key not in standard_keys and isinstance(values, (list, np.ndarray)): | |
reward_extra_infos_dict[key] = values.tolist() if isinstance(values, np.ndarray) else values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is conflict with pre-commit, ignore
Great! Having dapo retool as the baseline |
reabse with main to fix this |
https://github.com/volcengine/verl/actions/runs/17026023726/job/48262007255#step:6:234 Register your parameters. |
Co-authored-by: Shawn/Yuxuan Tong <[email protected]>
@gemini-code-assist /review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dynamic sampling feature, a significant enhancement for improving training efficiency and robustness by filtering and backfilling data batches. The implementation is well-structured, introducing an extensible DynamicFilter
class. The configuration changes and new metrics for pre-filter rewards are logical and well-documented. However, I've identified a critical issue in the batch backfilling logic that could lead to incorrect batch sizes, and another high-severity issue regarding the use of zip
that could mask potential data inconsistencies. Addressing these points will ensure the feature is robust and reliable.
@gemini-code-assist /review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dynamic sampling feature to improve training efficiency and robustness. The implementation is well-structured, featuring a modular DynamicFilter
class and a clear separation of concerns for metrics and configuration. My review identifies a critical documentation issue regarding feature compatibility and a high-severity bug in the backfill limit logic. Addressing these points will enhance the feature's usability and prevent unexpected runtime failures for users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
What does this PR do?
Overview
We propose to upstream the Dynamic Sampling (also known as Dynamic Fair Turing) feature implemented in DAPO to the verl main branch. This feature significantly improves sample efficiency and training robustness through intelligent batch filtering and backfilling strategies.
What is Dynamic Sampling?
Dynamic Sampling is an advanced training strategy that addresses the non-stable problem in reinforcement learning by implementing intelligent sample filtering and batch construction:
Core Strategy
Benefits
Implementation Details
Configuration Structure
Key Components
1. Dynamic Filtering Logic
2. Batch Accumulation and Backfilling
3. Intelligent Sample Selection
Log Bias
Problem: Critic-Reward Logging Bias
When dynamic sampling is enabled, the logged critic-reward is based on the filtered training batch, which may discard high-reward samples during the filtering process. This results in artificially low recorded critic-reward values that don't accurately reflect the true sample quality.
Solution: Dual Logging Strategy
We've implemented a comprehensive logging approach that captures both perspectives:
1. Pre-Filter Metrics (Raw Sample Quality)
2. Post-Filter Metrics (Training Batch Quality)
This dual approach provides:
I understand! You want me to add a section about the separated dynamic_filter class directly into the original PR description you provided. Here's the section you can add to your existing PR:
Extensible Filter Architecture
To better serve different datasets' reward patterns, we have separated the filtering logic into a modular
DynamicFilterManager
class that allows users to customize their own filter functions.Modular Design
The
DynamicFilterManager
class provides a clean interface for loading custom filter functions:Custom Filter Interface
Users can implement dataset-specific filter functions following a simple signature:
Configuration Examples
Default Mixed Rewards Filter (Original DAPO):
Custom Filter for Specific Datasets:
This modular architecture enables easy adaptation to different datasets' unique reward patterns - from mathematical reasoning tasks requiring solution diversity, to code generation needing correctness variation, to creative tasks demanding quality spread. Users can implement domain-specific filtering strategies without modifying the core Dynamic Sampling infrastructure.
Experimental Results
We tested this feature with the DAPO task and observed significant improvements:
Future Work
In DAPO recipe, the oversample is available for generate customizable rollout batch size, avoiding backfill overhead. Our current implementation dose not has this mechanism.
To achieve that, we might be able to integrate over sample PR into dynamic sampling.
Checklist Before Starting
[{modules}] {type}: {description}
(This will be checked by the CI){modules}
includefsdp
,megatron
,sglang
,vllm
,rollout
,trainer
,ci
,training_utils
,recipe
,hardware
,deployment
,ray
,worker
,single_controller
,misc
,perf
,model
,algo
,env
,tool
,ckpt
,doc
,data
,
like[megatron, fsdp, doc]
{type}
is infeat
,fix
,refactor
,chore
,test
[BREAKING]
to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batching
Test
API and Usage Example
# Add code snippet or script demonstrating how to use this
Design & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
ci-request
channel in theverl
Slack workspace. (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)