🤖 AI Summary
Transformers exhibit poor generalization to unseen sequence lengths—particularly on arithmetic and algorithmic tasks—due to their inherent dependence on training-length statistics. To address this, we propose the Recurrent Transformer, the first architecture to integrate iterative RASP-L–computable operations within a Transformer backbone. Our method introduces two key innovations: (1) an adaptive step-count control mechanism that dynamically determines the number of recurrent iterations per input, and (2) an iteration-aware supervised learning objective that explicitly supervises intermediate computational states. This design eliminates explicit reliance on fixed sequence length, enabling intrinsically length-invariant algorithmic modeling. Empirically, our model achieves 100% accuracy on length extrapolation for addition and parity tasks—outperforming standard Transformers and all prior baselines by large margins. These results demonstrate that finite-capacity Transformers can learn and generalize truly length-agnostic algorithmic logic when endowed with appropriate inductive biases for iterative computation.
📝 Abstract
Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation - a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.