Why Attention Fails: The Degeneration of Transformers into MLPs in Time Series Forecasting

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This paper identifies the fundamental cause of poor performance of Transformers in time-series forecasting: existing positional and value embedding schemes disrupt latent-space structural order, causing self-attention mechanisms to degenerate into identity mappings functionally equivalent to multilayer perceptrons (MLPs), thereby failing to capture temporal dependencies. Method: We construct an interpretable synthetic dataset, design modular ablation experiments, and conduct theoretical analysis to systematically characterize the degradation pathway of attention. Contribution/Results: Empirical results show that attention weights in mainstream time-series Transformer models are approximately uniform, and their forecasting accuracy exhibits no statistically significant difference from corresponding MLP baselines. We introduce the novel concept of *embedding structural disorder*—a principled explanation for attention failure—and provide both theoretical grounding and empirical validation to guide the design of temporally aware Transformer architectures.

Technology Category

Application Category

📝 Abstract

Transformer-based architectures achieved high performance in natural language processing and computer vision, yet many studies have shown that they have not demonstrated a clear advantage in time series forecasting and even underperform simple linear baselines in some cases. However, most of these studies have not thoroughly explored the reasons behind the failure of transformers. To better understand time-series transformers(TST), we designed a series of experiments, progressively modifying transformers into MLPs to investigate the impact of the attention mechanism. Surprisingly, transformer blocks often degenerate into simple MLPs in existing time-series transformers. We designed a interpretable dataset to investigate the reasons behind the failure of the attention mechanism and revealed that the attention mechanism is not working in the expected way. We theoretically analyzed the reasons behind this phenomenon, demonstrating that the current embedding methods fail to allow transformers to function in a well-structured latent space, and further analyzed the deeper underlying causes of the failure of embedding.

Problem

Research questions and friction points this paper is trying to address.

Investigating why transformers underperform simple linear models in time series forecasting

Analyzing how attention mechanisms degenerate into MLPs in time series transformers

Exploring why embedding methods fail to create well-structured latent spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressively modified transformers into MLPs

Designed interpretable dataset to analyze attention failure

Theoretically analyzed embedding methods causing latent space issues

🔎 Similar Papers

WAVE: Weighted Autoregressive Varing Gate for Time Series Forecasting