🤖 AI Summary
To address the insufficient robustness of Transformers in chaotic time-series forecasting, this paper proposes Easy Attention—a lightweight attention mechanism that replaces the conventional query-key dot product and softmax operations with end-to-end learnable scalar attention scores. Theoretically, we establish for the first time that queries, keys, and softmax are not essential to self-attention, and reveal via singular value decomposition (SVD) that standard self-attention fundamentally performs low-rank information compression. Methodologically, we redesign the Transformer encoder by jointly optimizing temporal reconstruction and predictive modeling. Experiments on the Lorenz system, turbulent shear flow, and nuclear reactor dynamics demonstrate that our approach surpasses both standard Transformers and LSTMs in prediction accuracy, while reducing computational complexity, mitigating training instability, and significantly enhancing generalization and robustness.
📝 Abstract
To improve the robustness of transformer neural networks used for temporal-dynamics prediction of chaotic systems, we propose a novel attention mechanism called easy attention which we demonstrate in time-series reconstruction and prediction. While the standard self attention only makes use of the inner product of queries and keys, it is demonstrated that the keys, queries and softmax are not necessary for obtaining the attention score required to capture long-term dependencies in temporal sequences. Through the singular-value decomposition (SVD) on the softmax attention score, we further observe that self attention compresses the contributions from both queries and keys in the space spanned by the attention score. Therefore, our proposed easy-attention method directly treats the attention scores as learnable parameters. This approach produces excellent results when reconstructing and predicting the temporal dynamics of chaotic systems exhibiting more robustness and less complexity than self attention or the widely-used long short-term memory (LSTM) network. We show the improved performance of the easy-attention method in the Lorenz system, a turbulence shear flow and a model of a nuclear reactor.