🤖 AI Summary
Standard attention mechanisms suffer from O(L²) computational complexity due to the Softmax operation, hindering scalability for multivariate time series modeling. To address this, we propose the first linear-time attention mechanism grounded in entropy equality—motivated by the insight that attention efficacy stems from the moderation and balance of weight distributions, not Softmax nonlinearity. We establish a structural similarity principle linking probabilistic rank alignment with entropy similarity, and leverage the strict concavity of entropy over the probability simplex to design a linear-complexity (O(L)) entropy approximation and attention weight estimation algorithm. Evaluated on four spatiotemporal forecasting benchmarks, our method achieves substantial reductions in memory and computation while matching or exceeding the predictive accuracy of state-of-the-art linear- and quadratic-complexity baselines—demonstrating both the effectiveness and scalability of entropy-driven attention modeling.
📝 Abstract
Attention mechanisms have been extensively employed in various applications, including time series modeling, owing to their capacity to capture intricate dependencies; however, their utility is often constrained by quadratic computational complexity, which impedes scalability for long sequences. In this work, we propose a novel linear attention mechanism designed to overcome these limitations. Our approach is grounded in a theoretical demonstration that entropy, as a strictly concave function on the probability simplex, implies that distributions with aligned probability rankings and similar entropy values exhibit structural resemblance. Building on this insight, we develop an efficient approximation algorithm that computes the entropy of dot-product-derived distributions with only linear complexity, enabling the implementation of a linear attention mechanism based on entropy equality. Through rigorous analysis, we reveal that the effectiveness of attention in spatio-temporal time series modeling may not primarily stem from the non-linearity of softmax but rather from the attainment of a moderate and well-balanced weight distribution. Extensive experiments on four spatio-temporal datasets validate our method, demonstrating competitive or superior forecasting performance while achieving substantial reductions in both memory usage and computational time.