Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting

πŸ“… 2025-09-25
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper identifies the fundamental cause of poor performance of Transformers in time-series forecasting: existing positional and value embedding schemes disrupt latent-space structural order, causing self-attention mechanisms to degenerate into identity mappings functionally equivalent to multilayer perceptrons (MLPs), thereby failing to capture temporal dependencies. Method: We construct an interpretable synthetic dataset, design modular ablation experiments, and conduct theoretical analysis to systematically characterize the degradation pathway of attention. Contribution/Results: Empirical results show that attention weights in mainstream time-series Transformer models are approximately uniform, and their forecasting accuracy exhibits no statistically significant difference from corresponding MLP baselines. We introduce the novel concept of *embedding structural disorder*β€”a principled explanation for attention failureβ€”and provide both theoretical grounding and empirical validation to guide the design of temporally aware Transformer architectures.

Technology Category

Application Category

πŸ“ Abstract
Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.
Problem

Research questions and friction points this paper is trying to address.

Investigating why transformers underperform simple linear models in time series forecasting
Analyzing how attention mechanisms degenerate into MLPs in time series transformers
Exploring why embedding methods fail to create well-structured latent spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressively modified transformers into MLPs
Designed interpretable dataset to analyze attention failure
Theoretically analyzed embedding methods causing latent space issues
πŸ”Ž Similar Papers
No similar papers found.