Mitigating Think-Answer Mismatch in LLM Reasoning Through Noise-Aware Advantage Reweighting

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Group-Relative Policy Optimization (GRPO) for large reasoning models suffers from Think-Answer Mismatch—i.e., response-group imbalance introduces noisy reward signals that distort advantage estimation and dilute critical learning signals. Method: We formally characterize this issue and propose S-GRPO, a noise-aware variant of GRPO featuring a dynamic advantage reweighting mechanism that suppresses the influence of high-noise responses within imbalanced groups, thereby stabilizing signal propagation. Contribution/Results: Evaluated on mathematical reasoning benchmarks, S-GRPO achieves average improvements exceeding 2.3% across Qwen and Llama-based models. Crucially, it maintains effective convergence even under 20% strong reward noise, significantly enhancing both robustness and performance in complex reasoning training. To our knowledge, this is the first work to formalize Think-Answer Mismatch in GRPO and introduce a principled, noise-resilient optimization framework for reasoning-oriented reinforcement learning.

Technology Category

Application Category

📝 Abstract

Group-Relative Policy Optimization (GRPO) is a key technique for training large reasoning models, yet it suffers from a critical vulnerability: the emph{Think-Answer Mismatch}, where noisy reward signals corrupt the learning process. This problem is most severe in unbalanced response groups, paradoxically degrading the signal precisely when it should be most informative. To address this challenge, we propose Stable Group-Relative Policy Optimization (S-GRPO), a principled enhancement that derives optimal, noise-aware advantage weights to stabilize training. Our comprehensive experiments on mathematical reasoning benchmarks demonstrate S-GRPO's effectiveness and robustness. On various models, S-GRPO significantly outperforms DR. GRPO, achieving performance gains of +2.5% on Qwen-Math-7B-Base, +2.2% on Llama-3.2-3B-Base, and +2.4% on Qwen-Math-1.5B-Instruct. Most critically, while standard GRPO fails to learn under 20% synthetic reward noise, S-GRPO maintains stable learning progress. These results highlight S-GRPO's potential for more robust and effective training of large-scale reasoning models. footnote{Code and data are available at: https://github.com/shenpeijun0212/S-GRPO

Problem

Research questions and friction points this paper is trying to address.

Mitigates Think-Answer Mismatch in LLM reasoning

Addresses noisy reward signals in unbalanced response groups

Enhances training stability with noise-aware advantage weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Noise-aware advantage reweighting for stable training

Enhanced Group-Relative Policy Optimization (S-GRPO)

Optimal weights to mitigate reward signal noise

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting