🤖 AI Summary
This work addresses the challenge of jointly modeling long-range dependencies and local patterns in time-series forecasting. We propose WAVE, a novel attention mechanism that—uniquely—integrates the full ARMA statistical model into an autoregressive Transformer decoder. By leveraging linear attention and indirect weight generation, WAVE explicitly incorporates moving-average (MA) components without increasing computational complexity, enabling joint modeling of temporal dynamics. The method preserves token-level autoregressive prediction while ensuring both theoretical interpretability and engineering efficiency. Evaluated on multiple standard benchmarks, WAVE consistently outperforms existing autoregressive attention models, achieving state-of-the-art (SOTA) performance with improved forecasting accuracy and robustness.
📝 Abstract
We propose a Weighted Autoregressive Varing gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.