🤖 AI Summary
This work addresses the significant yet poorly understood influence of task difficulty and computational budget on the sample efficiency of reinforcement learning (RL) for enhancing reasoning in large language models (LLMs). To unify the analysis across diverse budgeting regimes, the paper introduces a novel metric termed “relative budget,” defined as ξ = H/E[T], the ratio of generation length to the expected number of tokens required to reach the first correct solution. This metric delineates three distinct learning regimes—scarce, balanced, and abundant—and, through theoretical analysis within an online RL framework, reveals how ξ governs reward variance and the probability of generating effective trajectories, thereby enabling finite-sample learning guarantees. Empirical results demonstrate that both learning efficiency and reasoning performance peak when ξ ∈ [1.5, 2.0], offering principled guidance for budget allocation in RL-based LLM training.
📝 Abstract
Reinforcement learning (RL) is a dominant paradigm for improving the reasoning abilities of large language models, yet its effectiveness varies across tasks and compute budgets. We propose a \emph{relative-budget} theory explaining this variation through a single quantity called relative budget $\xi := H/\mathbb{E}[T]$, where $H$ is the generation horizon (token budget) and $T$ denotes the number of tokens until the first correct solution under a base policy. We show that $\xi$ determines sample efficiency by controlling reward variance and the likelihood of informative trajectories. Our analysis reveals three regimes: in the \emph{deficient} regime ($\xi \to 0$), informative trajectories are rare and the sample complexity explodes; in the \emph{balanced} regime ($\xi=\Theta(1)$), informative trajectories occur with non-negligible probability and RL is maximally sample-efficient; and in the \emph{ample} regime ($\xi \to \infty$), learning remains stable but marginal gains per iteration diminish. We further provide finite-sample guarantees for online RL that characterize learning progress across these regimes. Specifically, in a case study under idealized distributional assumptions, we show that the relative budget grows linearly over iterations. Our empirical results confirm these predictions in realistic settings, identifying a budget $\xi \in [1.5, 2.0]$ that maximizes learning efficiency and coincides with peak reasoning performance.