Sparse Autoencoders Do Not Find Canonical Units of Analysis

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work challenges the hypothesis that sparse autoencoders (SAEs) can discover unique, complete, and atomic interpretable features—termed “canonical units”—in large language models. To assess completeness, we introduce *SAE stitching*, a method that substitutes features across distinct SAEs to expose missing latent variables. To evaluate atomicity, we propose *meta-SAE*, a meta-level autoencoder that further decomposes SAE hidden units into finer-grained semantic components. Empirical results demonstrate that SAEs are neither complete—uncovered latent variables persist—nor atomic: e.g., the feature “Einstein” decomposes into interpretable subunits such as “scientist”, “Germany”, and “famous person”. Crucially, these decomposed units retain strong interpretability. This study provides the first systematic empirical refutation of the canonical units hypothesis. We publicly release an interactive meta-SAE exploration platform, establishing a new paradigm and open-source toolkit for mechanistic interpretability research in neural networks.

Technology Category

Application Category

📝 Abstract

A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a extit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/

Problem

Research questions and friction points this paper is trying to address.

Sparse autoencoders fail to find canonical units

Novel techniques reveal incompleteness and non-atomicity

Meta-SAEs decompose latents into interpretable combinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAE stitching shows incompleteness

Meta-SAEs demonstrate non-atomic latents

Interactive dashboard for meta-SAEs exploration

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models