🤖 AI Summary
This work investigates whether Reinforcement Learning from Verification Rewards (RLVR) can genuinely expand the reasoning capabilities of Large Language Models (LLMs). Method: We conduct systematic evaluation via large-k pass@k analysis, cross-architecture and multi-benchmark comparisons, modeling of reasoning path distributions, and controlled ablation studies contrasting RLVR with knowledge distillation. Contribution/Results: We find that RLVR does not induce novel reasoning patterns—every reasoning path generated by the RL-tuned model is already present in the base model’s output distribution. Instead, RLVR improves sampling efficiency at the cost of reduced reasoning diversity. As k increases, RL models are consistently outperformed by their base counterparts, indicating output distribution compression rather than expansion. In contrast, knowledge distillation successfully introduces reasoning behaviors absent in the base model. This study provides the first systematic evidence of RLVR’s fundamental limitation in enhancing LLM reasoning, challenging the “capability emergence” hypothesis and prompting critical reevaluation of prevailing LLM reasoning augmentation paradigms.
📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@ extit{k} metric with large values of extit{k} to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does emph{not}, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of $k$ (eg, $k$=1), base models can achieve a comparable or even higher pass@$k$ score compared to their RL counterparts at large $k$ values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io