Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

πŸ“… 2025-08-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Group-Relative Policy Optimization (GRPO) for large reasoning models suffers from Think-Answer Mismatchβ€”i.e., response-group imbalance introduces noisy reward signals that distort advantage estimation and dilute critical learning signals. Method: We formally characterize this issue and propose S-GRPO, a noise-aware variant of GRPO featuring a dynamic advantage reweighting mechanism that suppresses the influence of high-noise responses within imbalanced groups, thereby stabilizing signal propagation. Contribution/Results: Evaluated on mathematical reasoning benchmarks, S-GRPO achieves average improvements exceeding 2.3% across Qwen and Llama-based models. Crucially, it maintains effective convergence even under 20% strong reward noise, significantly enhancing both robustness and performance in complex reasoning training. To our knowledge, this is the first work to formalize Think-Answer Mismatch in GRPO and introduce a principled, noise-resilient optimization framework for reasoning-oriented reinforcement learning.

Technology Category

Application Category

πŸ“ Abstract
Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO
Problem

Research questions and friction points this paper is trying to address.

Mitigates Think-Answer Mismatch in LLM reasoning
Addresses noisy reward signals in unbalanced response groups
Enhances training stability with noise-aware advantage weights
Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware advantage reweighting for stable training
Enhanced Group-Relative Policy Optimization (S-GRPO)
Optimal weights to mitigate reward signal noise
Si Shen
Si Shen
Hong Kong University of Science and Technology
Data MiningWeb Search
P
Peijun Shen
Department of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China
W
Wenhua Zhao
Department of Information Management, Nanjing Agricultural University, Nanjing, 210095, China
D
Danhao Zhu
Department of Criminal Science and Technology, Jiangsu Police Institute, Nanjing, 210031, China