Skip to content

Conversation

Hecate0821
Copy link
Contributor

@Hecate0821 Hecate0821 commented Aug 9, 2025

What does this PR do?

Overview

We propose to upstream the Dynamic Sampling (also known as Dynamic Fair Turing) feature implemented in DAPO to the verl main branch. This feature significantly improves sample efficiency and training robustness through intelligent batch filtering and backfilling strategies.

What is Dynamic Sampling?

Dynamic Sampling is an advanced training strategy that addresses the non-stable problem in reinforcement learning by implementing intelligent sample filtering and batch construction:

Core Strategy

  1. Filter out uninformative samples - Remove samples with accuracy == 0 or 1 (no learning signal)
  2. Keep only samples with non-zero gradients - Preserve samples that contribute to learning
  3. Backfill the batch - Continue generating until the full mini-batch contains valid, informative samples

Benefits

  • Reduces zero-gradient sample proportion - Eliminates samples that don't contribute to learning
  • Increases gradient signal density - Each batch contains more informative samples
  • Prevents early saturation - Avoids entropy collapse by maintaining diverse, learnable samples
  • Improves training stability - More consistent learning signals across batches

Implementation Details

Configuration Structure

@dataclass
class FilterGroupsConfig(BaseConfig):
    enable: bool = False
    metric: Optional[str] = None  # "seq_reward"
    max_num_gen_batches: int = 0  # Max backfill attempts (0 = unlimited)

Key Components

1. Dynamic Filtering Logic

# Filter out prompts that are all positive OR all <= 0
all_positive = np.all(metric_vals > 0)
all_non_positive = np.all(metric_vals <= 0)
# Keep prompt only if it has both positive and non-positive values
should_keep = not (all_positive or all_non_positive) or len(metric_vals) == 1

2. Batch Accumulation and Backfilling

# Accumulate valid samples across multiple generation batches
accumulated_batch = (
    batch if accumulated_batch is None 
    else DataProto.concat([accumulated_batch, batch])
)

# Continue generating until target batch size is reached
if num_prompt_in_batch < prompt_bsz and num_gen_batches < max_num_gen_batches:
    continue  # Generate more batches

3. Intelligent Sample Selection

  • Uses reward variance as the primary filtering metric
  • Discards samples with uniform reward patterns (all high or all low)
  • Preserves samples with mixed reward patterns that provide learning signal

Log Bias

Problem: Critic-Reward Logging Bias

When dynamic sampling is enabled, the logged critic-reward is based on the filtered training batch, which may discard high-reward samples during the filtering process. This results in artificially low recorded critic-reward values that don't accurately reflect the true sample quality.

Solution: Dual Logging Strategy

We've implemented a comprehensive logging approach that captures both perspectives:

1. Pre-Filter Metrics (Raw Sample Quality)

# Calculate reward statistics over the entire rollout batch BEFORE filtering
reward_metrics = compute_reward_metrics(batch)
metrics.update(reward_metrics)
logger.log(data=metrics, step=self.global_steps)

2. Post-Filter Metrics (Training Batch Quality)

# Calculate post-filter reward pattern metrics
post_filter_metrics = compute_reward_pattern_metrics(
    batch.non_tensor_batch["uid"],
    batch.batch["token_level_scores"],
    prefix="train/post_filter_reward_pattern",
    include_exact_values=True,
)

This dual approach provides:

  • Accurate sample quality assessment - Pre-filter metrics show true rollout performance
  • Training effectiveness monitoring - Post-filter metrics show actual training batch quality
  • Complete visibility - Engineers can understand both the raw data quality and filtered training effectiveness
    I understand! You want me to add a section about the separated dynamic_filter class directly into the original PR description you provided. Here's the section you can add to your existing PR:

Extensible Filter Architecture

To better serve different datasets' reward patterns, we have separated the filtering logic into a modular DynamicFilterManager class that allows users to customize their own filter functions.

Modular Design

The DynamicFilterManager class provides a clean interface for loading custom filter functions:

class DynamicFilterManager:
    def __init__(self, filter_function: Optional[str] = None, metric: str = "seq_reward", **filter_kwargs):
        # Dynamically imports and applies user-specified filter functions

Custom Filter Interface

Users can implement dataset-specific filter functions following a simple signature:

def custom_filter_function(metric_vals: List[Union[float, int]], **kwargs) -> bool:
    """
    Args:
        metric_vals: List of metric values for samples from the same prompt
        **kwargs: Additional configuration parameters from filter_kwargs
    
    Returns:
        bool: True if prompt should be kept, False if filtered out
    """
    # Custom filtering logic here
    return should_keep

Configuration Examples

Default Mixed Rewards Filter (Original DAPO):

algorithm:
  dynamic_filter:
    enable: true
    filter_function: "verl.trainer.ppo.dynamic_filtering.keep_mixed_reward"

Custom Filter for Specific Datasets:

algorithm:
  dynamic_filter:
    enable: true
    filter_function: "my_module.my_custom_filter"
    filter_kwargs:
      threshold: 0.5
      min_variance: 0.2

This modular architecture enables easy adaptation to different datasets' unique reward patterns - from mathematical reasoning tasks requiring solution diversity, to code generation needing correctness variation, to creative tasks demanding quality spread. Users can implement domain-specific filtering strategies without modifying the core Dynamic Sampling infrastructure.

Experimental Results

We tested this feature with the DAPO task and observed significant improvements:

  • Green line: Without Dynamic Fair Turing (baseline)
  • Yellow line: With Dynamic Fair Turing enabled

Future Work

In DAPO recipe, the oversample is available for generate customizable rollout batch size, avoiding backfill overhead. Our current implementation dose not has this mechanism.

To achieve that, we might be able to integrate over sample PR into dynamic sampling.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Aug 9, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic sampling from DAPO into the main trainer to improve sample efficiency and training robustness. The core changes involve filtering rollout samples based on reward variance and backfilling mini-batches. A key addition is a new metric for pre-filtering rewards to provide more accurate logging, addressing a noted issue with critic-reward logging. The changes are primarily in verl/trainer/ppo/ray_trainer.py, with corresponding updates to configuration files and metric utilities. My review identified a critical bug in the _extract_reward_extra_infos helper function within ray_trainer.py that would cause a runtime error. The rest of the implementation for dynamic sampling appears to be logically sound and aligns with the goals outlined in the description.

Comment on lines 599 to 600
if key not in standard_keys and isinstance((list, np.ndarray), values):
reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The arguments to isinstance are swapped in two places on these lines. The correct usage is isinstance(object, classinfo). The current code will raise a TypeError at runtime when this function is called.

Suggested change
if key not in standard_keys and isinstance((list, np.ndarray), values):
reward_extra_infos_dict[key] = values.tolist() if isinstance(np.ndarray, values) else values
if key not in standard_keys and isinstance(values, (list, np.ndarray)):
reward_extra_infos_dict[key] = values.tolist() if isinstance(values, np.ndarray) else values

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is conflict with pre-commit, ignore

@Hecate0821 Hecate0821 changed the title Ds nokl [trainer] feat: Upstream Dynamic Sampling Aug 9, 2025
@zhaochenyang20
Copy link
Collaborator

Great! Having dapo retool as the baseline

@zhaochenyang20
Copy link
Collaborator

@zhaochenyang20
Copy link
Collaborator

@Hecate0821
Copy link
Contributor Author

@gemini-code-assist /review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic sampling feature, a significant enhancement for improving training efficiency and robustness by filtering and backfilling data batches. The implementation is well-structured, introducing an extensible DynamicFilter class. The configuration changes and new metrics for pre-filter rewards are logical and well-documented. However, I've identified a critical issue in the batch backfilling logic that could lead to incorrect batch sizes, and another high-severity issue regarding the use of zip that could mask potential data inconsistencies. Addressing these points will ensure the feature is robust and reliable.

@Hecate0821
Copy link
Contributor Author

@gemini-code-assist /review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a dynamic sampling feature to improve training efficiency and robustness. The implementation is well-structured, featuring a modular DynamicFilter class and a clear separation of concerns for metrics and configuration. My review identifies a critical documentation issue regarding feature compatibility and a high-severity bug in the backfill limit logic. Addressing these points will enhance the feature's usability and prevent unexpected runtime failures for users.

Copy link
Contributor Author

@Hecate0821 Hecate0821 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants