What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

📅 2026-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing test-time reinforcement learning methods, which rely on majority voting to generate positive pseudo-labels and are prone to injecting erroneous supervision signals when answer distributions are dispersed, thereby amplifying noise. To mitigate this issue, the authors propose SCRL, a novel framework that introduces negative supervision into test-time reinforcement learning for the first time. SCRL employs a dual strategy: it filters out weak-consensus samples through selective positive pseudo-labeling and dynamically discards high-uncertainty negative samples via an entropy-gated mechanism based on generation uncertainty. By integrating stringent consensus-based filtering with uncertainty-aware negative sample suppression, SCRL achieves substantial performance gains over current approaches across multiple reasoning benchmarks, demonstrating stable training and strong generalization even under limited rollout budgets.

Technology Category

Application Category

📝 Abstract
Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.
Problem

Research questions and friction points this paper is trying to address.

Test-Time Reinforcement Learning
Label Noise
Weak Consensus
Pseudo-Labeling
Reasoning Trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Reinforcement Learning
Selective Positive Pseudo-Labeling
Negative Pseudo-Labeling
Consensus Filtering
Entropy-Gated Uncertainty
🔎 Similar Papers
No similar papers found.
Dong Yan
Dong Yan
AI Chief Expert, Bosch.
Reinforcement LearningFoundation Model
Jian Liang
Jian Liang
Kuaishou Inc.
transfer learninggraph learning
Y
Yanbo Wang
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences
S
Shuo Lu
NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences
R
Ran He
School of Artificial Intelligence, University of Chinese Academy of Sciences; NLPR & MAIS, Institute of Automation, Chinese Academy of Sciences
Tieniu Tan
Tieniu Tan
Institute of Automation, Chinese Academy of Sciences