Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work investigates whether Reinforcement Learning from Verification Rewards (RLVR) can genuinely expand the reasoning capabilities of Large Language Models (LLMs). Method: We conduct systematic evaluation via large-k pass@k analysis, cross-architecture and multi-benchmark comparisons, modeling of reasoning path distributions, and controlled ablation studies contrasting RLVR with knowledge distillation. Contribution/Results: We find that RLVR does not induce novel reasoning patterns—every reasoning path generated by the RL-tuned model is already present in the base model’s output distribution. Instead, RLVR improves sampling efficiency at the cost of reduced reasoning diversity. As k increases, RL models are consistently outperformed by their base counterparts, indicating output distribution compression rather than expansion. In contrast, knowledge distillation successfully introduces reasoning behaviors absent in the base model. This study provides the first systematic evidence of RLVR’s fundamental limitation in enhancing LLM reasoning, challenging the “capability emergence” hypothesis and prompting critical reevaluation of prevailing LLM reasoning augmentation paradigms.

Technology Category

Application Category

📝 Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@ extit{k} metric with large values of extit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

Problem

Research questions and friction points this paper is trying to address.

Does RLVR truly enhance LLM reasoning beyond base models?

RLVR biases output but doesn't create new reasoning patterns

Base models match RL performance at high pass@k values

Innovation

Methods, ideas, or system contributions that make the work stand out.

RLVR enhances reasoning via verifiable rewards

RL biases output to reward-yielding paths

Distillation introduces new knowledge unlike RLVR

🔎 Similar Papers

No similar papers found.