Residual Matrix Transformers: Scaling the Size of the Residual Stream

📅 2025-06-27

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

To address the fundamental limitation in Transformers—where residual flow capacity is tightly constrained by computational cost and model parameters, hindering efficient scaling—this paper proposes the Residual Matrix Transformer (RMT). RMT replaces conventional scalar/vector residual connections with independently scalable outer-product memory matrices, thereby redefining information storage and retrieval. It further introduces variance-aware propagation rules and a theory-guided training dynamic optimization strategy. Crucially, RMT achieves the first decoupling of residual capacity from both FLOPs and parameter count. Experiments demonstrate that, at equivalent loss, RMT reduces FLOPs by 58%, parameters by 25%, and training tokens by 41%. Across diverse downstream tasks, RMT consistently outperforms standard Transformers, delivering substantial gains in both training efficiency and generalization performance.

Technology Category

Application Category

📝 Abstract

The residual stream acts as a memory bus where transformer layers both store and access features (Elhage et al., 2021). We consider changing the mechanism for retrieving and storing information in the residual stream, and replace the residual stream of the transformer with an outer product memory matrix (Kohonen, 1972, Anderson, 1972). We call this model the Residual Matrix Transformer (RMT). We find that the RMT enjoys a number of attractive properties: 1) the size of the residual stream can be scaled independently of compute and model size, improving performance, 2) the RMT can achieve the same loss as the transformer with 58% fewer FLOPS, 25% fewer parameters, and 41% fewer training tokens tokens, and 3) the RMT outperforms the transformer on downstream evaluations. We theoretically analyze the transformer and the RMT, and show that the RMT allows for more efficient scaling of the residual stream, as well as improved variance propagation properties. Code for this project can be found at https://github.com/bmac3/residual-matrix-transformer.

Problem

Research questions and friction points this paper is trying to address.

Scaling residual stream size independently of compute and model size

Reducing FLOPS and parameters while maintaining performance

Improving downstream task performance with residual matrix transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces residual stream with outer product memory matrix

Scales residual stream independently of compute size

Improves efficiency with fewer FLOPS and parameters

🔎 Similar Papers

No similar papers found.