How Far Can Unsupervised RLVR Scale LLM Training?

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the efficacy and limitations of unsupervised reinforcement learning with verifiable rewards (URLVR) in training large language models. Addressing the challenge of constructing reward signals without ground-truth labels, the work proposes a unified theoretical framework that reveals intrinsic reward mechanisms as essentially sharpening the initial distribution. It further demonstrates that performance hinges on the alignment between the model’s initial confidence and correctness. The paper introduces “model collapse step” as a practical metric to assess the compatibility between a model’s prior and its trainability under reinforcement learning, observing a common performance trend of initial improvement followed by degradation across intrinsic URLVR methods. Additionally, it explores an extrinsic reward approach grounded in computational asymmetry, providing preliminary evidence of its potential to surpass the confidence-correctness performance ceiling inherent to intrinsic methods.

Technology Category

Application Category

📝 Abstract
Unsupervised reinforcement learning with verifiable rewards (URLVR) offers a pathway to scale LLM training beyond the supervision bottleneck by deriving rewards without ground truth labels. Recent works leverage model intrinsic signals, showing promising early gains, yet their potential and limitations remain unclear. In this work, we revisit URLVR and provide a comprehensive analysis spanning taxonomy, theory and extensive experiments. We first classify URLVR methods into intrinsic versus external based on reward sources, then establish a unified theoretical framework revealing that all intrinsic methods converge toward sharpening the model's initial distribution This sharpening mechanism succeeds when initial confidence aligns with correctness but fails catastrophically when misaligned. Through systematic experiments, we show intrinsic rewards consistently follow a rise-then-fall pattern across methods, with collapse timing determined by model prior rather than engineering choices. Despite these scaling limits, we find intrinsic rewards remain valuable in test-time training on small datasets, and propose Model Collapse Step to measure model prior, serving as a practical indicator for RL trainability. Finally, we explore external reward methods that ground verification in computational asymmetries, showing preliminary evidence they may escape the confidence-correctness ceiling. Our findings chart boundaries for intrinsic URLVR while motivating paths toward scalable alternatives.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised Reinforcement Learning
Verifiable Rewards
Large Language Models
Model Collapse
Reward Scaling
Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised reinforcement learning
verifiable rewards
model collapse
reward sharpening
computational asymmetry
🔎 Similar Papers
No similar papers found.
Bingxiang He
Bingxiang He
Second year PhD Candidate, Tsinghua University
Natural Language Processing
Y
Yuxin Zuo
Tsinghua University, Shanghai AI Lab
Z
Zeyuan Liu
Tsinghua University
S
Shangziqi Zhao
Xi’an Jiaotong University
Zixuan Fu
Zixuan Fu
Nanyang Technological University
Image RestorationGenerative ModelsLow-level Vision
Junlin Yang
Junlin Yang
Department of Computer Science and Technology, Tsinghua University
Natural Language ProcessingMachine Learning
Cheng Qian
Cheng Qian
University of Illinois, Urbana-Champaign
Tool LearningAgent
Kaiyan Zhang
Kaiyan Zhang
Tsinghua University
Foundation ModelCollective IntelligenceScientific Intelligence
Yuchen Fan
Yuchen Fan
Shanghai AI Laboratory & Shanghai Jiao Tong University
NLPLarge Language ModelsEvaluation
Ganqu Cui
Ganqu Cui
Shanghai AI Lab
LLM AlignmentReinforcement Learning
Xiusi Chen
Xiusi Chen
Postdoctoral Fellow, University of Illinois Urbana-Champaign
Language ModelsNeuro-Symbolic AIReasoning and PlanningLLM Alignment
Youbang Sun
Youbang Sun
Assistant Researcher, Tsinghua University; Northeastern University; Texas A&M University
Distributed OptimizationMulti-Agent RLRiemannian OptimizationFederated Learning
Xingtai Lv
Xingtai Lv
Tsinghua University
Large Language ModelNatural Language Processing
Xuekai Zhu
Xuekai Zhu
Shanghai Jiao Tong University
Synthetic DataReasoningLanguage Model
L
Li Sheng
Tsinghua University
R
Ran Li
Tsinghua University
Huan-ang Gao
Huan-ang Gao
Ph.D. student, Tsinghua University
AgentVision & Robotics
Yuchen Zhang
Yuchen Zhang
Peking University
Large Language ModelsReinforcement LearningEfficient Deep Learning
Bowen Zhou
Bowen Zhou
Chair Professor, Department of Electrical Engineering, Tsinghua University; Founder of Frontis.ai
Machine LearningNatural Language ProcessingRepresentation Learning and ReasoningConversational
Zhiyuan Liu
Zhiyuan Liu
Tsinghua University
autonomous drivingtraffic simulation
Ning Ding
Ning Ding
Assistant Professor, Tsinghua University
Natural Language ProcessingMachine Learning