🤖 AI Summary
Reward noise—e.g., human labeling errors or verification inaccuracies—in RLHF/RLVR severely degrades the performance of group-based policy optimization, yet existing methods lack systematic modeling and correction of such noise.
Method: We propose Dr.GRPO (Noise-Robust Group Relative Policy Optimization), the first framework to integrate label-noise correction into RLHF. Leveraging theoretical analysis, we establish that group policy optimization intrinsically possesses noise robustness. Under a Bernoulli noise model, Dr.GRPO jointly estimates reward flip probabilities, computes intra-group relative advantages, and performs bias correction to yield unbiased gradient estimates.
Contribution/Results: Evaluated on mathematical reasoning and code generation tasks, Dr.GRPO achieves substantial improvements: up to +6.7 percentage points in math accuracy and +1.5 points in code task performance. These results empirically validate its effectiveness and generalizability under realistic reward noise conditions.
📝 Abstract
Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.