MC-GRPO: Median-Centered Group Relative Policy Optimization for Small-Rollout Reinforcement Learning

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

162K/year

🤖 AI Summary

This work addresses the instability and performance degradation in small-scale rollout reinforcement learning caused by noise-sensitive mean-based baselines, which can lead to sign flips in advantage estimates. To mitigate this issue, the authors propose MC-GRPO, the first method to incorporate a median baseline into Group Relative Policy Optimization (GRPO). By centering advantage estimates around the median, MC-GRPO effectively suppresses the influence of outlier rewards. Additionally, a gradient exclusion mechanism is introduced to maintain computational overhead comparable to the original GRPO. Empirical results demonstrate that MC-GRPO significantly enhances training stability and final accuracy in low-sample regimes, reducing the performance gap between G=2 and G=8 rollouts to within 1% across diverse models and scales.

Technology Category

Application Category

📝 Abstract

Group-relative policy optimization methods train language models by generating multiple rollouts per prompt and normalizing rewards with a shared mean reward baseline. In resource-constrained settings where the rollout budget is small, accuracy often degrades. We find that noise in the shared baseline induces advantage sign flips, where some rollouts receive an incorrect advantage sign, and the update direction is reversed. To address this, we propose Median-Centered Group Relative Policy Optimization (MC-GRPO), a simple and effective solution for small-rollout training. Our main idea is to replace the mean baseline with a median baseline: the median is far less sensitive to outlier rewards than the mean, mitigating the sign flips under small rollout size (G). We generate one additional rollout for median reference (G+1), and compute advantages by using the group median. With an odd-sized group, exactly one completion is the median and receives zero advantage, we exclude this pivot rollout from backpropagation so the number of gradient-contributing samples per prompt remains G, preserving the core update cost of standard G-rollout training. Across various GRPO-family methods and a wide range of models and scales, this median-centered training consistently improves stability and final accuracy in the low-rollout regime, reducing the gap between G=2 and G=8 to within 1%. Code is available at https://github.com/lotusroot-kim/MC-GRPO

Problem

Research questions and friction points this paper is trying to address.

small-rollout reinforcement learning

group-relative policy optimization

advantage sign flips

reward baseline noise

policy optimization stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Median-Centered

Small-Rollout RL

Advantage Sign Flip