Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Existing deep Transformers exhibit structural misalignment in autoregressive time-series forecasting: while linear attention in a single layer is theoretically equivalent to dynamic vector autoregression (VAR), stacking multiple layers breaks alignment with the autoregressive (AR) objective—leading to weak generative modeling capability, poor interpretability, and limited generalization. Method: This paper establishes, for the first time, the theoretical correspondence between linear Transformers and dynamic VAR models, and introduces a structural rearrangement mechanism that enforces strict AR constraints across multi-layer attention. Building on this, we propose SAMoVAR—a fully interpretable model that explicitly integrates dynamic VAR weights with attention mechanisms. Contribution/Results: On multivariate time-series forecasting benchmarks, SAMoVAR surpasses state-of-the-art methods, simultaneously improving prediction accuracy, interpretability, and computational efficiency. Our results empirically validate that architectural–objective structural alignment is critical for capturing the intrinsic dynamics of time series and achieving robust generalization.

Technology Category

Application Category

📝 Abstract

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention sometimes outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

Problem

Research questions and friction points this paper is trying to address.

Aligns Transformer with autoregressive forecasting objectives

Improves interpretability and generalization in time series forecasting

Proposes SAMoVAR for efficient multivariate forecasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear attention as VAR

Reconfigured multi-layer Transformer

SAMoVAR for TSF

🔎 Similar Papers

WAVE: Weighted Autoregressive Varing Gate for Time Series Forecasting