How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

198K/year
🤖 AI Summary
This study investigates the performance gains conferred by each additional recurrence in recurrent language models under equivalent parameter budgets. Through iso-depth scaling experiments at fixed depth—supported by 116 pretraining runs and joint scaling law modeling—the work quantifies, for the first time, a recurrence equivalence exponent φ = 0.46, demonstrating that recurrence yields non-negligible benefits yet falls short of being fully equivalent to adding new layers. Under identical training compute, a 410M recurrent model with recurrence depth r = 4 matches the performance of a 580M non-recurrent model (whose training cost is comparable to that of a 1B-parameter model). While the performance gap narrows on open-domain downstream tasks, it persists on knowledge-intensive ones. These findings provide a predictive theoretical foundation for designing recurrent architectures.

Technology Category

Application Category

📝 Abstract
We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.
Problem

Research questions and friction points this paper is trying to address.

recurrence
scaling laws
looped language models
model capacity
validation loss
Innovation

Methods, ideas, or system contributions that make the work stand out.

recurrence-equivalence exponent
looped language models
iso-depth scaling laws
depth-recurrent architectures
scaling laws
🔎 Similar Papers
No similar papers found.