Quantitative Bounds for Length Generalization in Transformers

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work investigates length generalization in Transformer models—i.e., their ability to extrapolate from short training sequences to longer, unseen ones. We characterize the minimal training sequence length required for reliable length extrapolation and establish, for the first time, quantitative theoretical bounds on this threshold. Our approach integrates rigorous theoretical analysis with empirical validation across key settings: ℓ∞- and mean-error criteria, infinite- and finite-precision softmax, and single- and two-layer architectures. We identify “sufficient short-sequence simulability of internal model behavior” as the fundamental mechanism enabling successful extrapolation. The theory yields qualitative criteria for requisite training length, which are comprehensively validated through experiments. This constitutes the first theoretically grounded, quantitative design guideline for length generalization in Transformers.

Technology Category

Application Category

📝 Abstract

We study the problem of length generalization (LG) in transformers: the ability of a model trained on shorter sequences to maintain performance when evaluated on much longer, previously unseen inputs. Prior work by Huang et al. (2025) established that transformers eventually achieve length generalization once the training sequence length exceeds some finite threshold, but left open the question of how large it must be. In this work, we provide the first quantitative bounds on the required training length for length generalization to occur. Motivated by previous empirical and theoretical work, we analyze LG in several distinct problem settings: $ell_infty$ error control vs. average error control over an input distribution, infinite-precision softmax attention vs. finite-precision attention (which reduces to an argmax) in the transformer, and one- vs. two-layer transformers. In all scenarios, we prove that LG occurs when the internal behavior of the transformer on longer sequences can be "simulated" by its behavior on shorter sequences seen during training. Our bounds give qualitative estimates for the length of training data required for a transformer to generalize, and we verify these insights empirically. These results sharpen our theoretical understanding of the mechanisms underlying extrapolation in transformers, and formalize the intuition that richer training data is required for generalization on more complex tasks.

Problem

Research questions and friction points this paper is trying to address.

Quantifying training length bounds for transformer length generalization

Analyzing generalization under different error control and attention mechanisms

Establishing simulation-based conditions for extrapolation to longer sequences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Quantitative bounds for length generalization threshold

Simulation of long sequence behavior via training

Analysis across precision and layer variations

🔎 Similar Papers

A Formal Framework for Understanding Length Generalization in Transformers