Mechanistic Interpretability for Transformer-based Time Series Classification

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Transformer models for time-series classification suffer from opaque internal decision-making, hindering trust and interpretability. Method: This paper pioneers the systematic application of mechanistic interpretability techniques to time-series Transformers, proposing a multi-scale analytical framework integrating activation patching, attention saliency analysis, and sparse autoencoders—augmented by causal probing to construct causal graphs of internal information flow. The approach precisely identifies critical attention heads and discriminative time steps, and uncovers latent feature representations driving classification. Contribution/Results: Experiments across multiple benchmark time-series datasets demonstrate that the method effectively disentangles Transformer functional architecture, reveals causal dependency paths in temporal modeling, and significantly enhances both the interpretability and credibility of model decisions. This work establishes a novel paradigm for transparent and controllable time-series AI.

Technology Category

Application Category

📝 Abstract
Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.
Problem

Research questions and friction points this paper is trying to address.

Understanding internal decision-making mechanisms in transformer-based time series classification models
Adapting mechanistic interpretability techniques from NLP to time series transformers
Probing causal roles of attention heads and temporal positions in classifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting activation patching from NLP to time series
Applying attention saliency to reveal causal structures
Using sparse autoencoders to uncover latent features
🔎 Similar Papers
No similar papers found.
M
Matīss Kalnāre
Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
S
Sofoklis Kitharidis
Leiden Institute of Advanced Computer Science (LIACS), Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands
Thomas Bäck
Thomas Bäck
Professor of Computer Science, Leiden University; Chief Scientist, NORCE Research Centre, Norway
Evolutionary ComputationEvolutionary AlgorithmsMachine LearningIndustry 4.0Natural Computing
Niki van Stein
Niki van Stein
Leiden University
Explainable AIAutomated Algorithm DiscoveryDeeplearningbayesian optimization