Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

📅 2026-05-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the issue of advantage collapse in Group Relative Policy Optimization (GRPO), where homogeneous rewards within groups lead to vanishing gradients and stalled learning. To diagnose this phenomenon quantitatively, the study introduces Advantage Collapse Rate (ACR) as a novel metric and proposes AVSPO, a lightweight algorithm that adaptively injects virtual reward samples to sustain effective learning signals without incurring additional inference overhead. Evaluated within the RLVR framework across models ranging from 0.5B to 14B parameters, AVSPO reduces ACR by 58–63% and improves mathematical reasoning accuracy by 4–6 percentage points, while preserving strong out-of-domain generalization capabilities.
📝 Abstract
Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at https://qingyonghu.github.io/AVSPO.
Problem

Research questions and friction points this paper is trying to address.

advantage collapse
Group Relative Policy Optimization
Reinforcement Learning from Verifiable Rewards
vanishing gradients
homogeneous rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Advantage Collapse
ACR
AVSPO
GRPO
Reinforcement Learning from Verifiable Rewards
🔎 Similar Papers
2023-11-03International Conference on Machine LearningCitations: 4
X
Xixiang He
National University of Defense Technology, Changsha, Hunan, China
Qiyao Sun
Qiyao Sun
QueenMary University of London
AI Scientist
A
Ao Cheng
National University of Defense Technology, Changsha, Hunan, China
X
Xingming Li
National University of Defense Technology, Changsha, Hunan, China
X
Xuanyu Ji
National University of Defense Technology, Changsha, Hunan, China
H
Hailun Lu
Intelligent Game and Decision Lab, Beijing, China
R
Runke Huang
The Chinese University of Hong Kong, Shenzhen, Guangdong, China
Qingyong Hu
Qingyong Hu
Ph.D. of Computer Science, University of Oxford
3D VisionPhotogrammetryPoint Cloud ProcessingAutonomous Driving