🤖 AI Summary
This study investigates the performance gains conferred by each additional recurrence in recurrent language models under equivalent parameter budgets. Through iso-depth scaling experiments at fixed depth—supported by 116 pretraining runs and joint scaling law modeling—the work quantifies, for the first time, a recurrence equivalence exponent φ = 0.46, demonstrating that recurrence yields non-negligible benefits yet falls short of being fully equivalent to adding new layers. Under identical training compute, a 410M recurrent model with recurrence depth r = 4 matches the performance of a 580M non-recurrent model (whose training cost is comparable to that of a 1B-parameter model). While the performance gap narrows on open-domain downstream tasks, it persists on knowledge-intensive ones. These findings provide a predictive theoretical foundation for designing recurrent architectures.
📝 Abstract
We measure how much one extra recurrence is worth to a looped (depth-recurrent) language model, in equivalent unique parameters. From an iso-depth sweep of 116 pretraining runs across recurrence counts $r \in \{1, 2, 4, 8\}$ spanning ${\sim}50\times$ in training compute, we fit a joint scaling law $L = E + A\,(N_\text{once} + r^{\varphi} N_\text{rec})^{-α} + B\,D^{-β}$ and recover a new recurrence-equivalence exponent $\varphi = 0.46$ at $R^2 = 0.997$. Intuitively, $\varphi$ tells us whether looping a block $r$ times is equivalent in validation loss to $r$ unique blocks of a non-looped model (full equivalence, $\varphi{=}1$) or to a single block run repeatedly with no capacity gain ($\varphi{=}0$). Our $\varphi = 0.46$ sits in between, so each additional recurrence predictably increases validation loss at matched training compute. For example, at $r{=}4$ a 410M looped model performs on par with a 580M non-looped model, but pays the training cost of a 1B non-looped one. On a five-axis downstream evaluation, the gap persists on parametric-knowledge tasks and closes on simple open-book tasks, while reasoning tasks are not resolvable at our compute budgets. For any looped LM, our $\varphi$ converts the design choice of $r$ into a predictable validation-loss cost, and future training recipes and architectures can be compared by how much they raise $\varphi$ above $0.46$.