New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study investigates whether reinforcement learning with verifiable rewards (RLVR) endows large language models with genuinely novel capabilities or merely amplifies latent traces of existing ones. To address this, the authors propose a probabilistic framework grounded in instance solvability, which isolates atomic reasoning steps by training exclusively on single-step operations and then evaluates performance on multi-step tasks. Leveraging the Algebrarium framework, they combine probabilistic modeling with Pearson correlation analysis to demonstrate that multi-step task success strongly correlates with the joint probability of constituent atomic steps (ρ ∈ [0.69, 0.96]). This finding suggests that RLVR enhances complex reasoning by amplifying pre-existing skills to explore new solution pathways. However, the work also reveals that global reward optimization can inadvertently degrade specific local competencies, indicating a potential trade-off between overall performance and fine-grained skill retention.

Technology Category

Application Category

📝 Abstract

Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning with Verifiable Rewards

Large Language Models

emergent reasoning

capability emergence

multi-step reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning with Verifiable Rewards

Emergent Reasoning

Probabilistic Framework