🤖 AI Summary
This work addresses the limited convergence and generalization of standard Transformers, which stem from their inadequate modeling of optimization dynamics. The authors reinterpret residual updates as first-order optimization steps on a proxy token energy landscape, where attention and MLP sublayers act as gradient oracles. For the first time, momentum mechanisms—such as triple momentum, Adam, and Muon—are systematically integrated into the Transformer architecture. Through optimizer-inspired design, explicit modeling of momentum flow, and controlled ablation studies, the study demonstrates that momentum—not preconditioning—is the key driver of performance gains, steering optimization toward flatter minima. The resulting TMMFormer achieves the lowest validation loss under identical compute budgets, significantly outperforming baseline models while exhibiting superior generalization and reduced catastrophic forgetting.
📝 Abstract
The residual update of a pre-norm Transformer layer admits an interpretation as one step of a first-order optimizer acting on a surrogate token energy, wherein the attention and MLP sublayers function as gradient oracles. Based on this observation, we build a family of optimizer-inspired Transformers (triple-momentum, Adam/AdamW, Muon, SOAP) and compare them under matched compute. In our main pretraining experiment, the triple-momentum TMMFormer achieves the lowest validation loss, outperforming the vanilla Transformer and prior architectural variants. A controlled ablation and supporting theory show that momentum, not preconditioning, is the main source of the gain. We further show that TMMFormer and other momentum-based designs reach flatter minima than the vanilla Transformer, which leads to less forgetting and better generalization.