On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

๐Ÿ“… 2024-10-02
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the expressive power and structural bottlenecks of Looped Transformers in sequence-to-sequence function approximation. We identify a fundamental theoretical limitation arising from their fixed-loop architecture and, for the first time, establish an approximation-rate analysis frameworkโ€”proving that expressive capacity strictly increases with loop count. To overcome this architectural ceiling, we propose a learnable loop-scaling mechanism grounded in timestep-aware positional encoding. Our method integrates modulus-of-continuity theory for functional approximation analysis, parameterized timestep encoding, and empirical validation. Experiments across multiple reasoning benchmarks demonstrate significant performance gains over strong baselines. Collectively, this work advances both the theoretical understanding and practical optimization of recurrent-style Transformers, establishing a new paradigm for analyzing and enhancing loop-based sequence modeling architectures.

Technology Category

Application Category

๐Ÿ“ Abstract
Looped Transformers provide advantages in parameter efficiency, computational capabilities, and generalization for reasoning tasks. However, their expressive power regarding function approximation remains underexplored. In this paper, we establish the approximation rate of Looped Transformers by defining the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts the incorporation of scaling parameters for each loop, conditioned on timestep encoding. Experiments validate the theoretical results, showing that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding.
Problem

Research questions and friction points this paper is trying to address.

Transformer
Sequence-to-Sequence
Performance Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cyclic Transformers
Parameter Resizing
Time Step Encoding
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kevin Xu
The University of Tokyo
Issei Sato
Issei Sato
University of Tokyo
Machine learning