What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses a critical limitation in existing test-time reinforcement learning methods, which rely on majority voting to generate positive pseudo-labels and are prone to injecting erroneous supervision signals when answer distributions are dispersed, thereby amplifying noise. To mitigate this issue, the authors propose SCRL, a novel framework that introduces negative supervision into test-time reinforcement learning for the first time. SCRL employs a dual strategy: it filters out weak-consensus samples through selective positive pseudo-labeling and dynamically discards high-uncertainty negative samples via an entropy-gated mechanism based on generation uncertainty. By integrating stringent consensus-based filtering with uncertainty-aware negative sample suppression, SCRL achieves substantial performance gains over current approaches across multiple reasoning benchmarks, demonstrating stable training and strong generalization even under limited rollout budgets.

Technology Category

Application Category

📝 Abstract

Test-Time Reinforcement Learning (TTRL) enables Large Language Models (LLMs) to enhance reasoning capabilities on unlabeled test streams by deriving pseudo-rewards from majority voting consensus. However, existing TTRL methods rely exclusively on positive pseudo-labeling strategies. Such reliance becomes vulnerable under challenging scenarios where answer distributions are highly dispersed, resulting in weak consensus that inadvertently reinforces incorrect trajectories as supervision signals. In this paper, we propose SCRL (Selective-Complementary Reinforcement Learning), a robust test-time reinforcement learning framework that effectively mitigates label noise amplification. SCRL develops Selective Positive Pseudo-Labeling, which enforces strict consensus criteria to filter unreliable majorities. Complementarily, SCRL introduces Entropy-Gated Negative Pseudo-Labeling, the first negative supervision mechanism in TTRL, to reliably prune incorrect trajectories based on generation uncertainty. Extensive experiments on multiple reasoning benchmarks demonstrate that SCRL achieves substantial improvements over baselines, while maintaining robust generalization and training stability under constrained rollout budgets. Our code is available at https://github.com/Jasper-Yan/SCRL.

Problem

Research questions and friction points this paper is trying to address.

Test-Time Reinforcement Learning

Label Noise

Weak Consensus

Pseudo-Labeling

Reasoning Trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-Time Reinforcement Learning

Selective Positive Pseudo-Labeling

Negative Pseudo-Labeling