Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

📅 2026-04-23
📈 Citations: 0
Influential: 0
📄 PDF

career value

176K/year
🤖 AI Summary
This work addresses the vulnerability of test-time reinforcement learning in mathematical reasoning to noise from pseudo-labels, particularly highlighting how responses with moderate consistency form an ambiguous region that amplifies bias in group-wise advantage estimation, thereby generating spurious optimization signals. To mitigate this, the authors propose DDRL, a unified framework that, for the first time, identifies and analyzes the dominant role of this ambiguous region in reward noise. DDRL employs frequency-based sampling to filter out ambiguous samples, uses a fixed-advantage debiased estimator to suppress estimation bias, and introduces a consensus-driven off-policy refinement mechanism for end-to-end denoising. Evaluated across three large language models and multiple mathematical reasoning benchmarks, DDRL significantly outperforms existing methods, effectively enhancing both reasoning stability and accuracy.

Technology Category

Application Category

📝 Abstract
Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.
Problem

Research questions and friction points this paper is trying to address.

test-time reinforcement learning
spurious signal
label noise
mathematical reasoning
reward noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time reinforcement learning
spurious signal mitigation
debiased advantage estimation
frequency-based sampling
consensus-based refinement
Yongcan Yu
Yongcan Yu
Master Student, CASIA
Trustworthy AISafety in AI
L
Lingxiao He
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences; Meituan
Jian Liang
Jian Liang
Kuaishou Inc.
transfer learninggraph learning
K
Kuangpu Guo
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences; University of Science and Technology of China
M
Meng Wang
Meituan
Q
Qianlong Xie
Meituan
X
Xingxing Wang
Meituan
R
Ran He
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences; School of Artificial Intelligence, University of Chinese Academy of Sciences