🤖 AI Summary
This paper addresses the limited recursive capability of Transformers in modeling long sequences. It systematically investigates two enhancement paradigms: depth-wise recursion (e.g., Universal Transformer) and chunked temporal recursion (e.g., Temporal Latent Bottleneck). The authors propose two key innovations: (1) a dynamic halting mechanism based on global mean activation, enabling adaptive computation steps; and (2) the first integration of depth-wise recursion into a chunked temporal framework, yielding a cross-paradigm fused architecture. Extensive evaluation on the Long-Range Arena (LRA) benchmark—including ListOps, Flip-Flop, and Logical Inference—demonstrates significant differences in the effectiveness of various recursive inductive biases. Results confirm that the dynamic halting mechanism jointly improves both modeling accuracy and computational efficiency. The implementation is publicly available.
📝 Abstract
In this paper, we comprehensively study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism: (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformers and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks, such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference. The code is released in: https://github.com/JRC1995/InvestigatingRecurrentTransformers/tree/main