How Much Is One Recurrence Worth? Iso-Depth Scaling Laws for Looped Language Models

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study investigates the performance gains conferred by each additional recurrence in recurrent language models under equivalent parameter budgets. Through iso-depth scaling experiments at fixed depth—supported by 116 pretraining runs and joint scaling law modeling—the work quantifies, for the first time, a recurrence equivalence exponent φ = 0.46, demonstrating that recurrence yields non-negligible benefits yet falls short of being fully equivalent to adding new layers. Under identical training compute, a 410M recurrent model with recurrence depth r = 4 matches the performance of a 580M non-recurrent model (whose training cost is comparable to that of a 1B-parameter model). While the performance gap narrows on open-domain downstream tasks, it persists on knowledge-intensive ones. These findings provide a predictive theoretical foundation for designing recurrent architectures.

Technology Category

Application Category

📝 Abstract

We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.

Problem

Research questions and friction points this paper is trying to address.

recurrence

scaling laws

looped language models

model capacity

validation loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

recurrence-equivalence exponent

looped language models

iso-depth scaling laws