π€ AI Summary
Group-Relative Policy Optimization (GRPO) for large reasoning models suffers from Think-Answer Mismatchβi.e., response-group imbalance introduces noisy reward signals that distort advantage estimation and dilute critical learning signals.
Method: We formally characterize this issue and propose S-GRPO, a noise-aware variant of GRPO featuring a dynamic advantage reweighting mechanism that suppresses the influence of high-noise responses within imbalanced groups, thereby stabilizing signal propagation.
Contribution/Results: Evaluated on mathematical reasoning benchmarks, S-GRPO achieves average improvements exceeding 2.3% across Qwen and Llama-based models. Crucially, it maintains effective convergence even under 20% strong reward noise, significantly enhancing both robustness and performance in complex reasoning training. To our knowledge, this is the first work to formalize Think-Answer Mismatch in GRPO and introduce a principled, noise-resilient optimization framework for reasoning-oriented reinforcement learning.
π Abstract
Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO