🤖 AI Summary
This work addresses a critical limitation in existing test-time reinforcement learning methods, which rely on majority voting to generate positive pseudo-labels and are prone to injecting erroneous supervision signals when answer distributions are dispersed, thereby amplifying noise. To mitigate this issue, the authors propose SCRL, a novel framework that introduces negative supervision into test-time reinforcement learning for the first time. SCRL employs a dual strategy: it filters out weak-consensus samples through selective positive pseudo-labeling and dynamically discards high-uncertainty negative samples via an entropy-gated mechanism based on generation uncertainty. By integrating stringent consensus-based filtering with uncertainty-aware negative sample suppression, SCRL achieves substantial performance gains over current approaches across multiple reasoning benchmarks, demonstrating stable training and strong generalization even under limited rollout budgets.
📝 Abstract
Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.