This is the official implementaion of paper Geometric-Mean Policy Optimization. This repository contains Pytorch training code, evaluation code, and visualization method.
- Geometric-Mean Policy Optimization
Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K.
-
Using GMPO in verl.
-
Using GMPO in this repo.
conda create -n gmpo python==3.10
conda activate gmpo
pip install vllm==0.8.4 && pip install oat-llm==0.1.3.post1
cd understand_r1_zero_main
pip install -e .
bash scripts/qwen2.5-math-7b-gmpo.sh
If you have any question about our work or this repository, please don't hesitate to contact us by emails or open an issue under this project.
- Part of the code is borrowed from understand-r1-zero, we sincerely thank them for their contributions to the community.
@article{zhao2025geometric,
title={Geometric-mean policy optimization},
author={Zhao, Yuzhong and Liu, Yue and Liu, Junpeng and Chen, Jingye and Wu, Xun and Hao, Yaru and Lv, Tengchao and Huang, Shaohan and Cui, Lei and Ye, Qixiang and others},
journal={arXiv preprint arXiv:2507.20673},
year={2025}
}