π€ AI Summary
Existing reinforcement learning approaches for language models rely heavily on human-annotated preference data for reward modeling, severely limiting scalability. This paper proposes COMPASS, a test-time self-supervised RL framework that enables autonomous learning from continuous experience streams without external supervision. COMPASS introduces a dual self-rewarding mechanism: (i) Dual-Calibrated Answer Reward (DCAR), which jointly calibrates model confidence and credibility while performing self-consistency analysis; and (ii) Decisive Path Reward (DPR), which explicitly models reasoning chain quality to optimize the reasoning processβnot just the final output. By avoiding majority-voting-induced pseudo-label entrenchment, COMPASS achieves, for the first time, joint self-evaluation of both reasoning outcomes and intermediate steps. Extensive experiments across diverse tasks and models demonstrate that COMPASS consistently and significantly improves complex reasoning performance, validating its effectiveness and scalability in continual learning settings.
π Abstract
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model's analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.