Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a critical limitation in existing test-time reinforcement learning methods, which rely on majority voting to estimate rewards and consequently overlook non-majority yet correct actions, leading to biased estimates and information loss. To overcome this, the authors propose Distribution-Aware Reward Estimation (DARE), a novel approach that extends reward modeling from point estimates to the full empirical distribution of trajectory rewards. DARE further incorporates an exploration bonus and a distribution pruning mechanism to enhance both the informativeness and robustness of the reward signal. By moving beyond the constraints of majority voting, DARE achieves significant performance gains—25.3% and 5.3% relative improvements on the AIME 2024 and AMC reasoning benchmarks, respectively—while markedly improving optimization stability and final task performance.

Technology Category

Application Category

📝 Abstract
Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.
Problem

Research questions and friction points this paper is trying to address.

Test-time reinforcement learning
Reward estimation
Distribution-aware
Majority voting
Unlabeled inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-Aware Reward Estimation
Test-Time Reinforcement Learning
Reward Estimation
Exploration Bonus
Distribution Pruning
🔎 Similar Papers
No similar papers found.
B
Bodong Du
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
X
Xuanqi Huang
Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China
Xiaomeng Li
Xiaomeng Li
Assistant Professor, The Hong Kong University of Science and Technology
Medical Image AnalysisAI in HealthcareDeep Learning