[RFC]Implements Group Filtering Policy Optimization #3479

jianfeng-Liu · 2025-09-15T10:49:32Z

What does this PR do?

This feature mainly implements Group Filtering Policy Optimization (GFPO) to reduce response length inflation in RL. For more details, see https://arxiv.org/pdf/2508.09726 and #3449.

Experimental Results

By running the script run_dapo_qwen_3_4b_w_GFPO.sh on the platform with 16 NPU (A3) , we obtained the following results（Among them, the experimental setup of DAPO is basically the same as that of GFPO, except for algorithm.filter_sample.enable=False. ）:

Metric Name	Step 10 DAPO	Step 20 DAPO	Step 10 GFPO	Step 20 GFPO
prompt_length/mean	139.938	138.312	139.938	138.312
prompt_length/max	216.000	237.000	216.000	237.000
prompt_length/min	84.000	84.000	84.000	84.000
response_length/mean	6930.680	6901.121	6216.094	5929.109
response_length/max	10240.000	10240.000	10240.000	10240.000
response_length/min	1072.000	1452.000	960.000	1595.000
response_length/clip_ratio	0.324	0.258	0.219	0.109
val-core/math_dapo/acc/best@32/mean	0.338	0.329	0.355	0.241
val-core/math_dapo/acc/best@32/std	0.093	0.119	0.095	0.043
val-core/math_dapo/acc/maj@32/mean	0.052	0.049	0.047	0.062
val-core/math_dapo/acc/maj@32/std	0.024	0.018	0.017	0.033
val-core/math_dapo/acc/mean@32	0.069	0.053	0.067	0.064
timing_s/generate_sequences	463.394	474.458	464.598	484.680
timing_s/gen	646.008	647.060	647.169	642.342
timing_s/update_actor	41.141	40.331	11.367	10.617
timing_s/testing	1129.709	1131.565	1134.280	1121.248
timing_s/step	1827.559	1829.380	1796.303	1777.560

Response Length: DAPO produces ~6,900 tokens on average, 700-1,000 longer than GFPO, resulting in a 1.5-2.4× higher clip-ratio and more redundant text.
Accuracy: best@32 is marginally better for DAPO (0.328 vs 0.241), yet maj@32 and mean@32 are almost identical.
Training Speed: actor-update takes 40-41 s for DAPO versus 10-11 s for GFPO, yielding a 4× speed-up and saving roughly 50 s per step, which scales to dozens of NPU hours over thousands of steps.
Recommendation: adopt GFPO for continued RL—it delivers comparable math accuracy with lower latency, memory usage.

CLAassistant · 2025-09-15T10:49:41Z

All committers have signed the CLA.

gemini-code-assist

Code Review

This pull request implements Group Filtering Policy Optimization (GFPO) to reduce response length inflation. The changes introduce new configuration options, core algorithm logic for filtering and sampling, and integration into the DPO trainer. My review has identified several critical issues, including incorrect configuration access, missing parameters in shell scripts, incorrect function arguments, and logical errors in loops that could lead to runtime errors or incorrect behavior. I've also pointed out a potential division-by-zero issue. I've provided suggestions to fix these problems.

gemini-code-assist · 2025-09-15T10:51:23Z

recipe/dapo/dapo_ray_trainer.py

+                    if self.config.algorithm.filter_samples.enable:
+                        new_batch.batch["response_mask"] = compute_response_mask(new_batch)
+                        filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,
+                                                                               metric=self.config.algorithm.filter_sample.metric,
+                                                                               metric_name="token_level_scores",
+                                                                               retain_count=self.config.algorithm.filter_sample.retain_count,
+                                                                               adaptive=self.config.algorithm.filter_sample.adaptive,
+                                                                               t_digest=t_digest,
+                                                                               easy_count=self.config.algorithm.filter_sample.easy_count,
+                                                                               medium_count=self.config.algorithm.filter_sample.medium_count,
+                                                                               hard_count=self.config.algorithm.filter_sample.hard_count,
+                                                                               very_hard_count=self.config.algorithm.filter_sample.hard_count,
+                                                                               )
+                        new_batch = new_batch[filtering_sampling_kept_traj_idxs]


There are a couple of issues in this block:

There's a typo in the configuration access on line 168. It should be self.config.algorithm.filter_sample.enable instead of self.config.algorithm.filter_samples.enable. This will cause a runtime error as filter_samples is not defined in the configuration.

On line 179, the very_hard_count parameter is being assigned the value of hard_count from the configuration. It should be using self.config.algorithm.filter_sample.very_hard_count.

Suggested change

if self.config.algorithm.filter_samples.enable:

new_batch.batch["response_mask"] = compute_response_mask(new_batch)

filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,

metric=self.config.algorithm.filter_sample.metric,

metric_name="token_level_scores",

retain_count=self.config.algorithm.filter_sample.retain_count,

adaptive=self.config.algorithm.filter_sample.adaptive,

t_digest=t_digest,

easy_count=self.config.algorithm.filter_sample.easy_count,

medium_count=self.config.algorithm.filter_sample.medium_count,

hard_count=self.config.algorithm.filter_sample.hard_count,

very_hard_count=self.config.algorithm.filter_sample.hard_count,

)

new_batch = new_batch[filtering_sampling_kept_traj_idxs]

if self.config.algorithm.filter_sample.enable:

new_batch.batch["response_mask"] = compute_response_mask(new_batch)

filtering_sampling_kept_traj_idxs = filtering_sampling(new_batch,

metric=self.config.algorithm.filter_sample.metric,

metric_name="token_level_scores",

retain_count=self.config.algorithm.filter_sample.retain_count,

adaptive=self.config.algorithm.filter_sample.adaptive,

t_digest=t_digest,

easy_count=self.config.algorithm.filter_sample.easy_count,

medium_count=self.config.algorithm.filter_sample.medium_count,

hard_count=self.config.algorithm.filter_sample.hard_count,

very_hard_count=self.config.algorithm.filter_sample.very_hard_count,

)

new_batch = new_batch[filtering_sampling_kept_traj_idxs]

gemini-code-assist · 2025-09-15T10:51:23Z

recipe/dapo/run_dapo_qwen2.5_32b.sh

+    algorithm.filter_sample.enable=True\
+    algorithm.filter_sample.metric="response length"\
+    algorithm.filter_sample.retain_count=8\
+    algorithm.filter_sample.adaptive=True\
+    algorithm.filter_sample.easy_count=4\
+    algorithm.filter_sample.medium_count=6\
+    algorithm.filter_sample.hard_count=8\


The algorithm.filter_sample.very_hard_count parameter is missing. When adaptive=True, the filtering_sampling function requires very_hard_count. If it's not provided, it will default to None and cause a TypeError when used in min(count, len(id_score)), as count could be None.

Suggested change

algorithm.filter_sample.enable=True\

algorithm.filter_sample.metric="response length"\

algorithm.filter_sample.retain_count=8\

algorithm.filter_sample.adaptive=True\

algorithm.filter_sample.easy_count=4\

algorithm.filter_sample.medium_count=6\

algorithm.filter_sample.hard_count=8\

algorithm.filter_sample.enable=True\

algorithm.filter_sample.metric="response length"\

algorithm.filter_sample.retain_count=8\

algorithm.filter_sample.adaptive=True\

algorithm.filter_sample.easy_count=4\

algorithm.filter_sample.medium_count=6\

algorithm.filter_sample.hard_count=8\

algorithm.filter_sample.very_hard_count=8\

gemini-code-assist · 2025-09-15T10:51:23Z

verl/trainer/ppo/core_algos.py

+        Returns:
+            kept_traj_idxs: the desirable responses to train on.
+    """
+    id2response_and_score, id2average_reward = compute_scores(data, metric, adaptive, metric_name)


The arguments to compute_scores are passed positionally, but the order is incorrect. adaptive is passed as metric_name and metric_name is passed as adaptive. This will likely cause a KeyError when trying to access data.batch[adaptive] inside compute_scores. You should use keyword arguments to avoid this kind of error.

Suggested change

id2response_and_score, id2average_reward = compute_scores(data, metric, adaptive, metric_name)

id2response_and_score, id2average_reward = compute_scores(data, metric=metric, metric_name=metric_name, adaptive=adaptive)

gemini-code-assist · 2025-09-15T10:51:23Z

verl/trainer/ppo/core_algos.py

+    else:
+        for id in id2response_and_score.keys():
+            id_score = id2response_and_score[id]
+            for i in range(min(retain_count, len(id_score[i]))):


There is a bug in the loop condition. id_score is a list of tuples, so id_score[i] is a tuple and len(id_score[i]) will always be 2. The intention is likely to iterate up to retain_count or the number of available scores for that id, which is len(id_score).

Suggested change

for i in range(min(retain_count, len(id_score[i]))):

for i in range(min(retain_count, len(id_score))):

gemini-code-assist · 2025-09-15T10:51:23Z

verl/trainer/ppo/core_algos.py

+            id2response_and_score[index[i]].append((i, response_length[i]))
+    elif metric == "token efficiency":
+        for i in range(bsz):
+            id2response_and_score[index[i]].append((i, -reward_value[i] / response_length[i]))


Division by response_length[i] can lead to a division-by-zero error if a response is empty (i.e., response_length[i] is 0). It's safer to add a small epsilon to the denominator to prevent this.

Suggested change

id2response_and_score[index[i]].append((i, -reward_value[i] / response_length[i]))

id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 1e-8)))

FightingZhen · 2025-09-15T11:38:25Z

recipe/dapo/run_dapo_qwen2.5_32b.sh

    algorithm.filter_groups.enable=${enable_filter_groups} \
    algorithm.filter_groups.max_num_gen_batches=${max_num_gen_batches} \
    algorithm.filter_groups.metric=${filter_groups_metric} \
+    algorithm.filter_sample.enable=True\


Adding default params about group filtering policy to existing DAPO scripts is not recommanded, how about creating a new file?

Okay, I have created two new file ——run_dapo_qwen2.5_32b_w_GFPO.sh and run_dapo_qwen_3_4b_w_GFPO.

FightingZhen · 2025-09-15T11:46:44Z

verl/trainer/ppo/core_algos.py

+        for i in range(bsz):
+            id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 10**(-8))))
+    else:
+        raise NotImplementedError


Suggest adding some instructions about this error, easy for understanding

FightingZhen · 2025-09-15T11:48:16Z

verl/version/version

@@ -1 +0,0 @@
-0.5.0.dev


why this content is removed?

tardis-key · 2025-09-16T10:15:49Z

verl/trainer/config/ppo_trainer.yaml

    # Power used for weight scaling in "pow" method
    weight_pow: 2.0
+
+  filter_sample:


The configurations in YAML are usually accompanied by necessary explanations to improve usability. For example, retain_count does not take effect when adaptive=True.

tardis-key · 2025-09-16T10:46:05Z

verl/trainer/ppo/core_algos.py

+        for i in range(bsz):
+            id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 10**(-8))))
+    else:
+        raise NotImplementedError(f"metric {metric} not supported")


the judgment logic for metric and the loop logic are redundant and can be merged into a unified processing approach. It's also good for furture extension of metrics

score_func = None
if metric == "response length":
score_func = lambda i: response_length[i]
elif metric == "token efficiency":
score_func = lambda i: -reward_value[i] / (response_length[i] + 1e-8)
else:
raise NotImplementedError(...)
for i in range(batch_size):
id2response_and_score[index[i]].append((i, score_func(i)))

tardis-key

LGTM on code implementation.
To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.

jianfeng-Liu · 2025-09-23T13:53:38Z

LGTM on code implementation. To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.

jianfeng-Liu · 2025-09-23T13:59:55Z

LGTM on code implementation. To prove the correctness of the implementation, I believe it is necessary to provide some experimental data and documentation.
Okay, I have provided some experimental results , and more specific relevant documents and experimental results can be found in Paper https://arxiv.org/pdf/2508.09726 and #3449.

tardis-key · 2025-09-26T01:53:48Z

please address comments from other reviewers

tardis-key

LGTM

jianfeng-Liu · 2025-09-26T07:10:46Z

@FightingZhen @eric-haibin-lin @CLAassistant Hi, Are there any other questions?

jianfeng-Liu · 2025-09-28T07:02:50Z

please address comments from other reviewers

Okay, I have addressed all the comments raised by the other reviewers so far.

eric-haibin-lin and others added 3 commits June 6, 2025 16:45

[release] chore: bump version to v0.4

4f630d4

[release] chore: pin version to v0.4.0

0758489

add new feature: GFPO

5c34eaf

jianfeng-Liu requested review from eric-haibin-lin, vermouth1992, tongyx361, PeterSH6, FightingZhen and ji-huazhong as code owners September 15, 2025 10:49

gemini-code-assist bot reviewed Sep 15, 2025

View reviewed changes

jianfeng-Liu mentioned this pull request Sep 15, 2025

[RFC]Implements Group Filtering Policy Optimization #3449

Open

jianfeng-Liu added 6 commits September 15, 2025 19:01

Update dapo_ray_trainer.py

8c571c3

Update run_dapo_qwen2.5_32b.sh

31ac84e

Update dapo_ray_trainer.py

9ccc7da

Update core_algos.py

fa3459f

Update core_algos.py

4b4e40c

Merge branch 'main' into detached

7dc98cd

FightingZhen reviewed Sep 15, 2025

View reviewed changes

jianfeng-Liu added 4 commits September 15, 2025 19:48

Update run_dapo_qwen2.5_32b.sh

d7cdd04

Create run_dapo_qwen2.5_32b_w_GFPO.sh

e211566

Update run_dapo_qwen2.5_32b.sh

8a5d734

Update core_algos.py

d2f8ef0

jianfeng-Liu changed the title ~~Implements Group Filtering Policy Optimization~~ [RFC]Implements Group Filtering Policy Optimization Sep 15, 2025

tardis-key reviewed Sep 16, 2025

View reviewed changes

jianfeng-Liu added 2 commits September 18, 2025 20:09

Update ppo_trainer.yaml

9ccd000

Update core_algos.py

49e050b

jianfeng-Liu requested review from FightingZhen and tardis-key September 18, 2025 13:46

tardis-key reviewed Sep 19, 2025

View reviewed changes

jianfeng-Liu added 3 commits September 22, 2025 19:49

Create run_dapo_qwen_3_4b_w_GFPO

4ebbdc5

Update run_dapo_qwen_3_4b_w_GFPO

4970fc0

Update run_dapo_qwen_3_4b_w_GFPO

170fa50

jianfeng-Liu closed this Sep 23, 2025

jianfeng-Liu deleted the detached branch September 23, 2025 13:53

jianfeng-Liu restored the detached branch September 23, 2025 13:53

jianfeng-Liu reopened this Sep 23, 2025

jianfeng-Liu requested a review from tardis-key September 24, 2025 11:16

tardis-key approved these changes Sep 26, 2025

View reviewed changes

Update checkpoint_converter.yml

5dc8d24

	id2response_and_score, id2average_reward = compute_scores(data, metric, adaptive, metric_name)
	id2response_and_score, id2average_reward = compute_scores(data, metric=metric, metric_name=metric_name, adaptive=adaptive)

	for i in range(min(retain_count, len(id_score[i]))):
	for i in range(min(retain_count, len(id_score))):

	id2response_and_score[index[i]].append((i, -reward_value[i] / response_length[i]))
	id2response_and_score[index[i]].append((i, -reward_value[i] / (response_length[i] + 1e-8)))

		@@ -1 +0,0 @@
		0.5.0.dev

[RFC]Implements Group Filtering Policy Optimization #3479

Are you sure you want to change the base?

[RFC]Implements Group Filtering Policy Optimization #3479

Conversation

jianfeng-Liu commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Experimental Results

Uh oh!

CLAassistant commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

FightingZhen Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

jianfeng-Liu Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FightingZhen Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

FightingZhen Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

tardis-key Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

tardis-key Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

tardis-key left a comment

Choose a reason for hiding this comment

Uh oh!

jianfeng-Liu commented Sep 23, 2025

Uh oh!

jianfeng-Liu commented Sep 23, 2025

Uh oh!

tardis-key commented Sep 26, 2025

Uh oh!

tardis-key left a comment

Choose a reason for hiding this comment

Uh oh!

jianfeng-Liu commented Sep 26, 2025

Uh oh!

jianfeng-Liu commented Sep 28, 2025

Uh oh!

Uh oh!

jianfeng-Liu commented Sep 15, 2025 •

edited

Loading

CLAassistant commented Sep 15, 2025 •

edited

Loading

jianfeng-Liu Sep 15, 2025 •

edited

Loading