🤖 AI Summary
The theoretical foundations underlying single-layer Transformer modeling of time-series data remain poorly understood, particularly regarding its representational capacity and inherent limitations in capturing dynamical processes.
Method: We formulate causal self-attention as a linear, history-dependent recurrence relation and analyze it through the lens of dynamical systems theory and delay embedding theory, conducting both linear and nonlinear case studies.
Contribution/Results: We establish that the convexity constraint imposed by softmax attention induces systematic distortion in modeling linear oscillatory systems—a previously unrecognized limitation. Conversely, we demonstrate that Transformers can autonomously perform delay embedding and state reconstruction in partially observable nonlinear systems. Our analysis precisely characterizes the boundary conditions under which Transformers succeed or fail in time-series modeling and identifies the fundamental determinants of zero-shot forecasting performance. These findings provide critical theoretical grounding for developing trustworthy, principled time-series models.
📝 Abstract
Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models, which benefit from well-established theoretical foundations, Transformer architectures are typically treated as black boxes. This gap becomes particularly relevant as attention-based models are considered for general-purpose or zero-shot forecasting across diverse dynamical regimes. In this work, we do not propose a new forecasting model, but instead investigate the representational capabilities and limitations of single-layer Transformers when applied to dynamical data. Building on a dynamical systems perspective we interpret causal self-attention as a linear, history-dependent recurrence and analyze how it processes temporal information. Through a series of linear and nonlinear case studies, we identify distinct operational regimes. For linear systems, we show that the convexity constraint imposed by softmax attention fundamentally restricts the class of dynamics that can be represented, leading to oversmoothing in oscillatory settings. For nonlinear systems under partial observability, attention instead acts as an adaptive delay-embedding mechanism, enabling effective state reconstruction when sufficient temporal context and latent dimensionality are available. These results help bridge empirical observations with classical dynamical systems theory, providing insight into when and why Transformers succeed or fail as models of dynamical systems.