The Power of Architecture: Deep Dive into Transformer Architectures for Long-Term Time Series Forecasting

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The lack of systematic guidance for Transformer architecture selection in long-term time series forecasting (LTSF) hinders principled model design. Method: We propose a decoupled taxonomy that orthogonally decomposes Transformer architectures into four dimensions—attention mechanism, normalization strategy, aggregation scheme, and forecasting paradigm—thereby isolating architectural effects from time-series-specific components. Contribution/Results: Through large-scale ablation studies with controlled variables, we identify the optimal configuration: bidirectional joint attention, full-stride aggregation, and direct mapping forecasting—achieving statistically significant improvements over state-of-the-art models across multiple benchmarks. This work establishes the first interpretable, reproducible evaluation framework for Transformer architectures in LTSF, yielding clear, actionable design principles for time-series modeling.

Technology Category

Application Category

📝 Abstract
Transformer-based models have recently become dominant in Long-term Time Series Forecasting (LTSF), yet the variations in their architecture, such as encoder-only, encoder-decoder, and decoder-only designs, raise a crucial question: What Transformer architecture works best for LTSF tasks? However, existing models are often tightly coupled with various time-series-specific designs, making it difficult to isolate the impact of the architecture itself. To address this, we propose a novel taxonomy that disentangles these designs, enabling clearer and more unified comparisons of Transformer architectures. Our taxonomy considers key aspects such as attention mechanisms, forecasting aggregations, forecasting paradigms, and normalization layers. Through extensive experiments, we uncover several key insights: bi-directional attention with joint-attention is most effective; more complete forecasting aggregation improves performance; and the direct-mapping paradigm outperforms autoregressive approaches. Furthermore, our combined model, utilizing optimal architectural choices, consistently outperforms several existing models, reinforcing the validity of our conclusions. We hope these findings offer valuable guidance for future research on Transformer architectural designs in LTSF. Our code is available at https://github.com/HALF111/TSF_architecture.
Problem

Research questions and friction points this paper is trying to address.

Identifying best Transformer architecture for long-term time series forecasting
Disentangling time-series-specific designs to isolate architectural impact
Evaluating key aspects like attention mechanisms and forecasting paradigms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes taxonomy to disentangle Transformer architecture designs
Identifies bi-directional joint-attention as most effective mechanism
Combines optimal architectural choices for superior model performance
🔎 Similar Papers
No similar papers found.
Lefei Shen
Lefei Shen
Zhejiang University
Time Series ForecastingDeep Learning
Mouxiang Chen
Mouxiang Chen
Zhejiang University
debiasinglarge language modelcode generationtime series
Han Fu
Han Fu
PHD student in KTH
machine learningdeep learningCI
Xiaoxue Ren
Xiaoxue Ren
Zhejiang University
Software Engineering
X
Xiaoyun Joy Wang
State Street Technology (Zhejiang) Ltd.
J
Jianling Sun
Zhejiang University
Z
Zhuo Li
State Street Technology (Zhejiang) Ltd.
C
Chenghao Liu
Salesforce Research Asia