Why Do Transformers Fail to Forecast Time Series In-Context?

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This paper investigates why Transformers often underperform linear models in time-series forecasting (TSF) and lack theoretical grounding, particularly regarding their failure in in-context learning (ICL). Method: We conduct the first ICL-theoretic analysis of Transformer-based TSF, modeling linear self-attention and deriving rigorous asymptotic theory. We prove that, under AR(p) data, linear self-attention is fundamentally limited: it can only asymptotically approximate the optimal linear predictor and induces mean collapse in chain-of-thought (CoT)-structured inference. Our validation combines analytical derivation with controlled CoT-structured experiments. Contribution/Results: We identify the theoretical root cause of Transformers’ performance bottleneck in TSF—namely, an intrinsic limitation in modeling long-range temporal dependencies via self-attention—even under idealized linear settings. This work establishes a verifiable theoretical foundation for diagnosing and improving Transformer architectures in time-series modeling.

Technology Category

Application Category

📝 Abstract

Time series forecasting (TSF) remains a challenging and largely unsolved problem in machine learning, despite significant recent efforts leveraging Large Language Models (LLMs), which predominantly rely on Transformer architectures. Empirical evidence consistently shows that even powerful Transformers often fail to outperform much simpler models, e.g., linear models, on TSF tasks; however, a rigorous theoretical understanding of this phenomenon remains limited. In this paper, we provide a theoretical analysis of Transformers' limitations for TSF through the lens of In-Context Learning (ICL) theory. Specifically, under AR($p$) data, we establish that: (1) Linear Self-Attention (LSA) models $ extit{cannot}$ achieve lower expected MSE than classical linear models for in-context forecasting; (2) as the context length approaches to infinity, LSA asymptotically recovers the optimal linear predictor; and (3) under Chain-of-Thought (CoT) style inference, predictions collapse to the mean exponentially. We empirically validate these findings through carefully designed experiments. Our theory not only sheds light on several previously underexplored phenomena but also offers practical insights for designing more effective forecasting architectures. We hope our work encourages the broader research community to revisit the fundamental theoretical limitations of TSF and to critically evaluate the direct application of increasingly sophisticated architectures without deeper scrutiny.

Problem

Research questions and friction points this paper is trying to address.

Analyzing Transformers' failure in time series forecasting theoretically

Establishing limitations of Linear Self-Attention versus classical linear models

Investigating prediction collapse under Chain-of-Thought inference scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed Transformers' limitations via In-Context Learning theory

Established Linear Self-Attention cannot beat linear models

Showed predictions collapse to mean under Chain-of-Thought

🔎 Similar Papers

No similar papers found.