Geometric-Mean Policy Optimization

📅 2025-07-28

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

GRPO enhances large language model reasoning by optimizing the arithmetic mean of token-level rewards, yet it suffers from sensitivity to outlier-weighted rewards, causing severe variance in importance sampling ratios and unstable policy updates. To address this, we propose Geometric Mean Policy Optimization (GMPO), the first method to incorporate the geometric mean into the policy optimization objective. This formulation inherently suppresses the variance of importance sampling ratios, thereby improving training robustness and generalization. Theoretically, the geometric mean exhibits superior robustness to extreme reward values compared to the arithmetic mean. GMPO builds upon the GRPO framework while integrating importance sampling with the novel geometric-mean-based objective. Empirical evaluation demonstrates that GMPO-7B achieves average improvements of 4.1% on mathematical reasoning benchmarks and 1.4% on multimodal reasoning benchmarks over GRPO, validating both its effectiveness and conceptual novelty.

Technology Category

Application Category

📝 Abstract

Recent advancements, such as Group Relative Policy Optimization (GRPO), have enhanced the reasoning capabilities of large language models by optimizing the arithmetic mean of token-level rewards. However, GRPO suffers from unstable policy updates when processing tokens with outlier importance-weighted rewards, which manifests as extreme importance sampling ratios during training, i.e., the ratio between the sampling probabilities assigned to a token by the current and old policies. In this work, we propose Geometric-Mean Policy Optimization (GMPO), a stabilized variant of GRPO. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. In addition, we provide comprehensive theoretical and experimental analysis to justify the design and stability benefits of GMPO. Beyond improved stability, GMPO-7B outperforms GRPO by an average of 4.1% on multiple mathematical benchmarks and 1.4% on multimodal reasoning benchmark, including AIME24, AMC, MATH500, OlympiadBench, Minerva, and Geometry3K. Code is available at https://github.com/callsys/GMPO.

Problem

Research questions and friction points this paper is trying to address.

Stabilize policy updates for outlier token rewards

Optimize geometric mean of token-level rewards

Improve performance on mathematical and reasoning benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Geometric-Mean Policy Optimization for stability

Maximizes geometric mean of token rewards

Outperforms GRPO on multiple benchmarks

🔎 Similar Papers

Doubly Optimal Policy Evaluation for Reinforcement Learning