🤖 AI Summary
Transformer models for time-series classification suffer from opaque internal decision-making, hindering trust and interpretability. Method: This paper pioneers the systematic application of mechanistic interpretability techniques to time-series Transformers, proposing a multi-scale analytical framework integrating activation patching, attention saliency analysis, and sparse autoencoders—augmented by causal probing to construct causal graphs of internal information flow. The approach precisely identifies critical attention heads and discriminative time steps, and uncovers latent feature representations driving classification. Contribution/Results: Experiments across multiple benchmark time-series datasets demonstrate that the method effectively disentangles Transformer functional architecture, reveals causal dependency paths in temporal modeling, and significantly enhances both the interpretability and credibility of model decisions. This work establishes a novel paradigm for transparent and controllable time-series AI.
📝 Abstract
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.