Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the vulnerability of test-time reinforcement learning in mathematical reasoning to noise from pseudo-labels, particularly highlighting how responses with moderate consistency form an ambiguous region that amplifies bias in group-wise advantage estimation, thereby generating spurious optimization signals. To mitigate this, the authors propose DDRL, a unified framework that, for the first time, identifies and analyzes the dominant role of this ambiguous region in reward noise. DDRL employs frequency-based sampling to filter out ambiguous samples, uses a fixed-advantage debiased estimator to suppress estimation bias, and introduces a consensus-driven off-policy refinement mechanism for end-to-end denoising. Evaluated across three large language models and multiple mathematical reasoning benchmarks, DDRL significantly outperforms existing methods, effectively enhancing both reasoning stability and accuracy.

Technology Category

Application Category

📝 Abstract

Test-time reinforcement learning (TTRL) always adapts models at inference time via pseudo-labeling, leaving it vulnerable to spurious optimization signals from label noise. Through an empirical study, we observe that responses with medium consistency form an ambiguity region and constitute the primary source of reward noise. Crucially, we find that such spurious signals can be even amplified through group-relative advantage estimation. Motivated by these findings, we propose a unified framework, Debiased and Denoised test-time Reinforcement Learning (DDRL), to mitigate spurious signals. Concretely, DDRL first applies a frequency-based sampling strategy to exclude ambiguous samples while maintaining a balanced set of positive and negative examples. It then adopts a debiased advantage estimation with fixed advantages, removing the bias introduced by group-relative policy optimization. Finally, DDRL incorporates a consensus-based off-policy refinement stage, which leverages the rejection-sampled dataset to enable efficient and stable model updates. Experiments on three large language models across multiple mathematical reasoning benchmarks demonstrate that DDRL consistently outperforms existing TTRL baselines. The code will soon be released at https://github.com/yuyongcan/DDRL.

Problem

Research questions and friction points this paper is trying to address.

test-time reinforcement learning

spurious signal

label noise

mathematical reasoning

reward noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time reinforcement learning

spurious signal mitigation

debiased advantage estimation