Looped Transformers for Length Generalization

📅 2024-09-24

🏛️ arXiv.org

📈 Citations: 7

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Transformers exhibit poor generalization to unseen sequence lengths—particularly on arithmetic and algorithmic tasks—due to their inherent dependence on training-length statistics. To address this, we propose the Recurrent Transformer, the first architecture to integrate iterative RASP-L–computable operations within a Transformer backbone. Our method introduces two key innovations: (1) an adaptive step-count control mechanism that dynamically determines the number of recurrent iterations per input, and (2) an iteration-aware supervised learning objective that explicitly supervises intermediate computational states. This design eliminates explicit reliance on fixed sequence length, enabling intrinsically length-invariant algorithmic modeling. Empirically, our model achieves 100% accuracy on length extrapolation for addition and parity tasks—outperforming standard Transformers and all prior baselines by large margins. These results demonstrate that finite-capacity Transformers can learn and generalize truly length-agnostic algorithmic logic when endowed with appropriate inductive biases for iterative computation.

Technology Category

Application Category

📝 Abstract

Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

Problem

Research questions and friction points this paper is trying to address.

Transformers struggle with length generalization on unseen input lengths

Looped Transformers improve length generalization using adaptive steps

Focus on tasks with iterative solutions via RASP-L operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Looped Transformers improve length generalization

Adaptive steps enable handling unseen input lengths

RASP-L operations support length-generalizable solutions

🔎 Similar Papers

No similar papers found.