Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

📅 2025-10-21

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Reward noise—e.g., human labeling errors or verification inaccuracies—in RLHF/RLVR severely degrades the performance of group-based policy optimization, yet existing methods lack systematic modeling and correction of such noise. Method: We propose Dr.GRPO (Noise-Robust Group Relative Policy Optimization), the first framework to integrate label-noise correction into RLHF. Leveraging theoretical analysis, we establish that group policy optimization intrinsically possesses noise robustness. Under a Bernoulli noise model, Dr.GRPO jointly estimates reward flip probabilities, computes intra-group relative advantages, and performs bias correction to yield unbiased gradient estimates. Contribution/Results: Evaluated on mathematical reasoning and code generation tasks, Dr.GRPO achieves substantial improvements: up to +6.7 percentage points in math accuracy and +1.5 points in code task performance. These results empirically validate its effectiveness and generalizability under realistic reward noise conditions.

Technology Category

Application Category

📝 Abstract

Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Addressing sensitivity to noisy rewards in reinforcement learning

Developing noise-robust policy optimization with unbiased gradients

Correcting reward corruption to improve math and code tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Models reward corruption as Bernoulli noise

Applies noise correction to debias gradients

Combines group methods with noise correction

🔎 Similar Papers

Reward Machines for Deep RL in Noisy and Uncertain Environments