🤖 AI Summary
Transformers exhibit severe length generalization failures on arithmetic tasks such as multi-operand addition (requiring generalization over both operand count and bit width) and multiplication (requiring bit-width generalization). To address this, we propose three core innovations: (1) a task-specific scratchpad mechanism that explicitly models intermediate computational steps; (2) a multi-level Position Coupling method to strengthen modeling of cross-position numerical relationships; and (3) a theoretically grounded single-layer Transformer architecture, the first provably capable of exponential generalization—over both operand count and bit width—up to an exponential function of embedding dimension. Our approach achieves 2–3× length extrapolation on standard arithmetic benchmarks, significantly outperforming all existing arithmetic Transformers. Theoretical analysis establishes formal guarantees on generalization capacity, while empirical results validate robust performance across diverse arithmetic operations and scaling regimes.
📝 Abstract
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.