Skip to content

Conversation

jianfeng-Liu
Copy link

@jianfeng-Liu jianfeng-Liu commented Sep 15, 2025

What does this PR do?

This feature mainly implements Group Filtering Policy Optimization (GFPO) to reduce response length inflation in RL. For more details, see https://arxiv.org/pdf/2508.09726 and #3449.

Experimental Results

By running the script run_dapo_qwen_3_4b_w_GFPO.sh on the platform with 16 NPU (A3) , we obtained the following results(Among them, the experimental setup of DAPO is basically the same as that of GFPO, except for algorithm.filter_sample.enable=False. ):

Metric Name Step 10 DAPO Step 20 DAPO Step 10 GFPO Step 20 GFPO
prompt_length/mean 139.938 138.312 139.938 138.312
prompt_length/max 216.000 237.000 216.000 237.000
prompt_length/min 84.000 84.000 84.000 84.000
response_length/mean 6930.680 6901.121 6216.094 5929.109
response_length/max 10240.000 10240.000 10240.000 10240.000
response_length/min 1072.000 1452.000 960.000 1595.000
response_length/clip_ratio 0.324 0.258 0.219 0.109
val-core/math_dapo/acc/best@32/mean 0.338 0.329 0.355 0.241
val-core/math_dapo/acc/best@32/std 0.093 0.119 0.095 0.043
val-core/math_dapo/acc/maj@32/mean 0.052 0.049 0.047 0.062
val-core/math_dapo/acc/maj@32/std 0.024 0.018 0.017 0.033
val-core/math_dapo/acc/mean@32 0.069 0.053 0.067 0.064
timing_s/generate_sequences 463.394 474.458 464.598 484.680
timing_s/gen 646.008 647.060 647.169 642.342
timing_s/update_actor 41.141 40.331 11.367 10.617
timing_s/testing 1129.709 1131.565 1134.280 1121.248
timing_s/step 1827.559 1829.380 1796.303 1777.560
  • Response Length: DAPO produces ~6,900 tokens on average, 700-1,000 longer than GFPO, resulting in a 1.5-2.4× higher clip-ratio and more redundant text.
  • Accuracy: best@32 is marginally better for DAPO (0.328 vs 0.241), yet maj@32 and mean@32 are almost identical.
  • Training Speed: actor-update takes 40-41 s for DAPO versus 10-11 s for GFPO, yielding a 4× speed-up and saving roughly 50 s per step, which scales to dozens of NPU hours over thousands of steps.
  • Recommendation: adopt GFPO for continued RL—it delivers comparable math accuracy with lower latency, memory usage.

@CLAassistant
Copy link

CLAassistant commented Sep 15, 2025

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Group Filtering Policy Optimization (GFPO) to reduce response length inflation. The changes introduce new configuration options, core algorithm logic for filtering and sampling, and integration into the DPO trainer. My review has identified several critical issues, including incorrect configuration access, missing parameters in shell scripts, incorrect function arguments, and logical errors in loops that could lead to runtime errors or incorrect behavior. I've also pointed out a potential division-by-zero issue. I've provided suggestions to fix these problems.

Comment on lines 168 to 181
if self.config.algorithm.filter_samples.enable:
new_batch.batch["response_mask"] = compute_response_mask(new_batch)
filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,
metric=self.config.algorithm.filter_sample.metric,
metric_name="token_level_scores",
retain_count=self.config.algorithm.filter_sample.retain_count,
adaptive=self.config.algorithm.filter_sample.adaptive,
t_digest=t_digest,
easy_count=self.config.algorithm.filter_sample.easy_count,
medium_count=self.config.algorithm.filter_sample.medium_count,
hard_count=self.config.algorithm.filter_sample.hard_count,
very_hard_count=self.config.algorithm.filter_sample.hard_count,
)
new_batch = new_batch[filtering_sampling_kept_traj_idxs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are a couple of issues in this block:

  1. There's a typo in the configuration access on line 168. It should be self.config.algorithm.filter_sample.enable instead of self.config.algorithm.filter_samples.enable. This will cause a runtime error as filter_samples is not defined in the configuration.
  2. On line 179, the very_hard_count parameter is being assigned the value of hard_count from the configuration. It should be using self.config.algorithm.filter_sample.very_hard_count.
Suggested change
if self.config.algorithm.filter_samples.enable:
new_batch.batch["response_mask"] = compute_response_mask(new_batch)
filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,
metric=self.config.algorithm.filter_sample.metric,
metric_name="token_level_scores",
retain_count=self.config.algorithm.filter_sample.retain_count,
adaptive=self.config.algorithm.filter_sample.adaptive,
t_digest=t_digest,
easy_count=self.config.algorithm.filter_sample.easy_count,
medium_count=self.config.algorithm.filter_sample.medium_count,
hard_count=self.config.algorithm.filter_sample.hard_count,
very_hard_count=self.config.algorithm.filter_sample.hard_count,
)
new_batch = new_batch[filtering_sampling_kept_traj_idxs]
if self.config.algorithm.filter_sample.enable:
new_batch.batch["response_mask"] = compute_response_mask(new_batch)
filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,
metric=self.config.algorithm.filter_sample.metric,
metric_name="token_level_scores",
retain_count=self.config.algorithm.filter_sample.retain_count,
adaptive=self.config.algorithm.filter_sample.adaptive,
t_digest=t_digest,
easy_count=self.config.algorithm.filter_sample.easy_count,
medium_count=self.config.algorithm.filter_sample.medium_count,
hard_count=self.config.algorithm.filter_sample.hard_count,
very_hard_count=self.config.algorithm.filter_sample.very_hard_count,
)
new_batch = new_batch[filtering_sampling_kept_traj_idxs]

Comment on lines 82 to 88
algorithm.filter_sample.enable=True\
algorithm.filter_sample.metric="response length"\
algorithm.filter_sample.retain_count=8\
algorithm.filter_sample.adaptive=True\
algorithm.filter_sample.easy_count=4\
algorithm.filter_sample.medium_count=6\
algorithm.filter_sample.hard_count=8\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The algorithm.filter_sample.very_hard_count parameter is missing. When adaptive=True, the filtering_sampling function requires very_hard_count. If it's not provided, it will default to None and cause a TypeError when used in min(count, len(id_score)), as count could be None.

Suggested change
algorithm.filter_sample.enable=True\
algorithm.filter_sample.metric="response length"\
algorithm.filter_sample.retain_count=8\
algorithm.filter_sample.adaptive=True\
algorithm.filter_sample.easy_count=4\
algorithm.filter_sample.medium_count=6\
algorithm.filter_sample.hard_count=8\
algorithm.filter_sample.enable=True\
algorithm.filter_sample.metric="response length"\
algorithm.filter_sample.retain_count=8\
algorithm.filter_sample.adaptive=True\
algorithm.filter_sample.easy_count=4\
algorithm.filter_sample.medium_count=6\
algorithm.filter_sample.hard_count=8\
algorithm.filter_sample.very_hard_count=8\

Returns:
kept_traj_idxs: the desirable responses to train on.
"""
id2response_and_score, id2average_reward = compute_scores(data, metric, adaptive, metric_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The arguments to compute_scores are passed positionally, but the order is incorrect. adaptive is passed as metric_name and metric_name is passed as adaptive. This will likely cause a KeyError when trying to access data.batch[adaptive] inside compute_scores. You should use keyword arguments to avoid this kind of error.

Suggested change
id2response_and_score, id2average_reward = compute_scores(data, metric, adaptive, metric_name)
id2response_and_score, id2average_reward = compute_scores(data, metric=metric, metric_name=metric_name, adaptive=adaptive)

else:
for id in id2response_and_score.keys():
id_score = id2response_and_score[id]
for i in range(min(retain_count, len(id_score[i]))):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a bug in the loop condition. id_score is a list of tuples, so id_score[i] is a tuple and len(id_score[i]) will always be 2. The intention is likely to iterate up to retain_count or the number of available scores for that id, which is len(id_score).

Suggested change
for i in range(min(retain_count, len(id_score[i]))):
for i in range(min(retain_count, len(id_score))):

id2response_and_score[index[i]].append((i, response_length[i]))
elif metric == "token efficiency":
for i in range(bsz):
id2response_and_score[index[i]].append((i, -reward_value[i] / response_length[i]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Division by response_length[i] can lead to a division-by-zero error if a response is empty (i.e., response_length[i] is 0). It's safer to add a small epsilon to the denominator to prevent this.

Suggested change
id2response_and_score[index[i]].append((i, -reward_value[i] / response_length[i]))
id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 1e-8)))

algorithm.filter_groups.enable=${enable_filter_groups} \
algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
algorithm.filter_groups.metric=${filter_groups_metric} \
algorithm.filter_sample.enable=True\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding default params about group filtering policy to existing DAPO scripts is not recommanded, how about creating a new file?

Copy link
Author

@jianfeng-Liu jianfeng-Liu Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have created two new file ——run_dapo_qwen2.5_32b_w_GFPO.sh and run_dapo_qwen_3_4b_w_GFPO.

for i in range(bsz):
id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 10**(-8))))
else:
raise NotImplementedError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest adding some instructions about this error, easy for understanding

@@ -1 +0,0 @@
0.5.0.dev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this content is removed?

@jianfeng-Liu jianfeng-Liu changed the title Implements Group Filtering Policy Optimization [RFC]Implements Group Filtering Policy Optimization Sep 15, 2025
# Power used for weight scaling in "pow" method
weight_pow: 2.0

filter_sample:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configurations in YAML are usually accompanied by necessary explanations to improve usability. For example, retain_count does not take effect when adaptive=True.

for i in range(bsz):
id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 10**(-8))))
else:
raise NotImplementedError(f"metric {metric} not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the judgment logic for metric and the loop logic are redundant and can be merged into a unified processing approach. It's also good for furture extension of metrics

score_func = None
if metric == "response length":
score_func = lambda i: response_length[i]
elif metric == "token efficiency":
score_func = lambda i: -reward_value[i] / (response_length[i] + 1e-8)
else:
raise NotImplementedError(...)
for i in range(batch_size):
id2response_and_score[index[i]].append((i, score_func(i)))

Copy link
Contributor

@tardis-key tardis-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on code implementation.
To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.

@jianfeng-Liu
Copy link
Author

LGTM on code implementation. To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.

@jianfeng-Liu jianfeng-Liu deleted the detached branch September 23, 2025 13:53
@jianfeng-Liu jianfeng-Liu restored the detached branch September 23, 2025 13:53
@jianfeng-Liu jianfeng-Liu reopened this Sep 23, 2025
@jianfeng-Liu
Copy link
Author

LGTM on code implementation. To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.
Okay, I have provided some experimental results , and more specific relevant documents and experimental results can be found in Paper https://arxiv.org/pdf/2508.09726 and #3449.

@tardis-key
Copy link
Contributor

please address comments from other reviewers

Copy link
Contributor

@tardis-key tardis-key left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jianfeng-Liu
Copy link
Author

@FightingZhen @eric-haibin-lin @CLAassistant Hi, Are there any other questions?

@jianfeng-Liu
Copy link
Author

please address comments from other reviewers

Okay, I have addressed all the comments raised by the other reviewers so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants