Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional Pass@K evaluation for Reinforcement Learning with Verifiable Rewards (RLVR) assesses only final answer correctness, ignoring reasoning process reliability—leading to misleading assessments of LLMs’ reasoning capabilities. Method: We propose CoT-Pass@K, a novel evaluation paradigm requiring both correct chain-of-thought (CoT) reasoning traces and correct final answers. We formally characterize RLVR’s intrinsic incentive structure for logical completeness and integrate training dynamics analysis with theoretical reasoning to examine early-stage generalization of correct reasoning. Results: Our analysis reveals that RLVR induces generalizable, logically sound reasoning as early as the initial training phase. Empirically, RLVR consistently and significantly improves CoT-Pass@K across all K values, demonstrating robust gains in reliable reasoning generalization. This work establishes both a theoretically grounded evaluation framework and a practical metric for assessing and optimizing reasoning reliability in LLMs.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for advancing the reasoning capabilities of Large Language Models (LLMs). However, a critical paradox clouds its efficacy: RLVR-tuned models often underperform their base models on the $Pass@K$ metric for solution-finding, leading to the hypothesis that RLVR merely re-weights existing reasoning paths at the cost of reasoning diversity. In this work, we resolve this contradiction by identifying the source of the problem: the $Pass@K$ metric itself is a flawed measure of reasoning, as it credits correct final answers that probably arise from inaccurate or incomplete chains of thought (CoTs). To address this, we introduce a more precise evaluation metric, $CoT$-$Pass@K$, which mandates that both the reasoning path and the final answer be correct. We provide a new theoretical foundation that formalizes how RLVR, unlike traditional RL, is uniquely structured to incentivize logical integrity. Our empirical results are supportive: using $CoT$-$Pass@K$, we observe that RLVR can incentivize the generalization of correct reasoning for all values of $K$. Furthermore, by analyzing the training dynamics, we find that this enhanced reasoning capability emerges early in the training process and smoothly generalizes. Our work provides a clear perspective on the role of RLVR, offers a more reliable method for its evaluation, and confirms its potential to genuinely advance machine reasoning.
Problem

Research questions and friction points this paper is trying to address.

RLVR-tuned models underperform base models on Pass@K metric
Pass@K metric flaws: credits correct answers from wrong reasoning
Introducing CoT-Pass@K to ensure correct reasoning and answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR enhances reasoning via verifiable rewards
Introduces CoT-Pass@K for accurate evaluation
Theoretical foundation for logical integrity
🔎 Similar Papers
No similar papers found.
Xumeng Wen
Xumeng Wen
MSRA
Z
Zihan Liu
Peking University
Shun Zheng
Shun Zheng
Microsoft Research Asia
LLM ReasoningAI for Industry
Zhijian Xu
Zhijian Xu
University of Science and Technology of China
Natural Language Processing
S
Shengyu Ye
Microsoft Research Asia
Z
Zhirong Wu
Microsoft Research Asia
X
Xiao Liang
University of California, Los Angeles
Y
Yang Wang
Microsoft Research Asia
J
Junjie Li
Microsoft Research Asia
Ziming Miao
Ziming Miao
Microsoft Research
J
Jiang Bian
Microsoft Research Asia
M
Mao Yang
Microsoft Research Asia