Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

📅 2026-05-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

246K/year
🤖 AI Summary
Although EEG foundation models demonstrate strong clinical performance, their black-box nature hinders trustworthy deployment. This work introduces TopK sparse autoencoders into three EEG Transformer architectures to extract sparse feature dictionaries from embedding layers and evaluates the disentanglement and semantic specificity of these features with respect to clinical concepts such as pathology, age, sex, and medication through concept manipulation and spectral decoding. We propose a generalizable interpretability pipeline applicable across architectures, defining “target versus non-target” probing regions to quantify intervention selectivity and uncover failure modes including age–pathology confounding and global performance collapse. Experiments successfully map latent-space interventions to physiologically interpretable spectral changes—such as suppression of pathological slow waves and restoration of alpha rhythms—identify three distinct intervention mechanisms, and validate the robust cross-model transferability of a unified hyperparameter strategy.
📝 Abstract
EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.
Problem

Research questions and friction points this paper is trying to address.

Mechanistic Interpretability
EEG Foundation Models
Sparse Autoencoders
Concept Entanglement
Clinical Trust
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders
EEG Foundation Models
Mechanistic Interpretability
Concept Steering
Spectral Decoding
🔎 Similar Papers
No similar papers found.
W
William Lehn-Schiøler
BrainCapture, Kongens Lyngby, Denmark; DTU Compute, Technical University of Denmark, Kongens Lyngby, Denmark; DTU Health Tech, Technical University of Denmark, Kongens Lyngby, Denmark
M
Magnus Ruud Kjær
DTU Health Tech, Technical University of Denmark, Kongens Lyngby, Denmark
Rahul Thapa
Rahul Thapa
Graduate Student, Stanford University
Machine LearningHealthcare AIData Science
M
Magnus Guldberg Pedersen
BrainCapture, Kongens Lyngby, Denmark; DTU Compute, Technical University of Denmark, Kongens Lyngby, Denmark
A
Anton Storgaard Mosquera
BrainCapture, Kongens Lyngby, Denmark; DTU Compute, Technical University of Denmark, Kongens Lyngby, Denmark
N
Nick Williams
Seer Medical, Melbourne, Australia
R
Radu Gatej
BrainCapture, Kongens Lyngby, Denmark
T
Tue Lehn-Schiøler
BrainCapture, Kongens Lyngby, Denmark
S
Sándor Beniczky
Filadelfia Epilepsy Hospital, Dianalund, Denmark; University Hospital of Copenhagen, Copenhagen, Denmark
Sadasivan Puthusserypady
Sadasivan Puthusserypady
Professor, Technical University of Denmark
Brain Computer InterfaceEEGBiomedical Signal ProcessingAI AlgorithmsMachine/Deep Learning
James Zou
James Zou
Stanford University
Machine learningcomputational biologycomputational healthstatisticsbiotech
Lars Kai Hansen
Lars Kai Hansen
Professor, Cognitive Systems, DTU Compute, Technical University of Denmark
Machine learningAIneuroimagingcognitive systemssignal processing