🤖 AI Summary
Although EEG foundation models demonstrate strong clinical performance, their black-box nature hinders trustworthy deployment. This work introduces TopK sparse autoencoders into three EEG Transformer architectures to extract sparse feature dictionaries from embedding layers and evaluates the disentanglement and semantic specificity of these features with respect to clinical concepts such as pathology, age, sex, and medication through concept manipulation and spectral decoding. We propose a generalizable interpretability pipeline applicable across architectures, defining “target versus non-target” probing regions to quantify intervention selectivity and uncover failure modes including age–pathology confounding and global performance collapse. Experiments successfully map latent-space interventions to physiologically interpretable spectral changes—such as suppression of pathological slow waves and restoration of alpha rhythms—identify three distinct intervention mechanisms, and validate the robust cross-model transferability of a unified hyperparameter strategy.
📝 Abstract
EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.