Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

📅 2024-10-21

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Transformers exhibit severe length generalization failures on arithmetic tasks such as multi-operand addition (requiring generalization over both operand count and bit width) and multiplication (requiring bit-width generalization). To address this, we propose three core innovations: (1) a task-specific scratchpad mechanism that explicitly models intermediate computational steps; (2) a multi-level Position Coupling method to strengthen modeling of cross-position numerical relationships; and (3) a theoretically grounded single-layer Transformer architecture, the first provably capable of exponential generalization—over both operand count and bit width—up to an exponential function of embedding dimension. Our approach achieves 2–3× length extrapolation on standard arithmetic benchmarks, significantly outperforming all existing arithmetic Transformers. Theoretical analysis establishes formal guarantees on generalization capacity, while empirical results validate robust performance across diverse arithmetic operations and scaling regimes.

Technology Category

Application Category

📝 Abstract

Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (requiring generalization over both the number of operands and their lengths) and multiplication (requiring generalization over both operand lengths). In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers. We design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to let Transformers know the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.

Problem

Research questions and friction points this paper is trying to address.

Transformers struggle with long sequence generalization.

Difficult arithmetic tasks include multi-operand addition and multiplication.

Achieve length generalization in arithmetic tasks using scratchpads and position coupling.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Task-specific scratchpads for focused token prediction

Multi-level Position Coupling for attention guidance

Exponential operand length generalization proof

🔎 Similar Papers

Looped Transformers for Length Generalization