Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing reinforcement learning approaches for language models rely heavily on human-annotated preference data for reward modeling, severely limiting scalability. This paper proposes COMPASS, a test-time self-supervised RL framework that enables autonomous learning from continuous experience streams without external supervision. COMPASS introduces a dual self-rewarding mechanism: (i) Dual-Calibrated Answer Reward (DCAR), which jointly calibrates model confidence and credibility while performing self-consistency analysis; and (ii) Decisive Path Reward (DPR), which explicitly models reasoning chain quality to optimize the reasoning process—not just the final output. By avoiding majority-voting-induced pseudo-label entrenchment, COMPASS achieves, for the first time, joint self-evaluation of both reasoning outcomes and intermediate steps. Extensive experiments across diverse tasks and models demonstrate that COMPASS consistently and significantly improves complex reasoning performance, validating its effectiveness and scalability in continual learning settings.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs), achieving remarkable performance in complex reasoning domains such as mathematics and code generation. However, current RL methods face a fundamental scalability bottleneck due to their heavy reliance on human-curated preference data or labeled datasets for reward modeling. To overcome this limitation, we explore RL on unlabeled data where models learn autonomously from continuous experience streams. The core challenge in this setting lies in reliable reward estimation without ground-truth supervision. Existing approaches like Test-Time RL address this through self-consistent consensus, but risk reinforcing incorrect pseudo-labels derived from majority voting. We introduce COMPASS (Composite Path and Answer Self-Scoring), a novel test-time reward mechanism that operates without external supervision. COMPASS integrates two complementary components: the Dual-Calibration Answer Reward (DCAR), which stabilizes training by establishing trustworthy pseudo-labels through confidence and credibility calibration, and the Decisive Path Reward (DPR), which directly optimizes the reasoning process quality beyond mere outcome supervision. By jointly reinforcing trustworthy consensus answers and highly decisive reasoning chains, the COMPASS systematically enhances the model's analytical capabilities. Extensive experiments show that COMPASS achieves significant and consistent performance gains across diverse reasoning tasks and model architectures, advancing a more scalable direction for LLMs to learn from continuous experience.

Problem

Research questions and friction points this paper is trying to address.

Developing unsupervised reward mechanisms for test-time reinforcement learning in LLMs

Overcoming reliance on human-curated data through autonomous learning from experience

Enhancing reasoning quality by jointly optimizing answer consensus and process decisiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composite self-scoring reward mechanism without supervision

Dual-calibration stabilizes pseudo-labels via confidence metrics

Jointly rewards consensus answers and decisive reasoning chains

🔎 Similar Papers

Self-Rewarding Language Models