🤖 AI Summary
To address the fundamental limitation in Transformers—where residual flow capacity is tightly constrained by computational cost and model parameters, hindering efficient scaling—this paper proposes the Residual Matrix Transformer (RMT). RMT replaces conventional scalar/vector residual connections with independently scalable outer-product memory matrices, thereby redefining information storage and retrieval. It further introduces variance-aware propagation rules and a theory-guided training dynamic optimization strategy. Crucially, RMT achieves the first decoupling of residual capacity from both FLOPs and parameter count. Experiments demonstrate that, at equivalent loss, RMT reduces FLOPs by 58%, parameters by 25%, and training tokens by 41%. Across diverse downstream tasks, RMT consistently outperforms standard Transformers, delivering substantial gains in both training efficiency and generalization performance.
📝 Abstract
The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.