Sanity Checks for Sparse Autoencoders: Do SAEs Beat Random Baselines?

📅 2026-02-15

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This study investigates whether sparse autoencoders (SAEs) genuinely recover semantically meaningful and interpretable features in neural networks, rather than merely achieving high reconstruction accuracy. The authors introduce a synthetic data setting with known ground-truth features to quantitatively measure SAEs’ feature recovery rates for the first time, complemented by evaluations on real model activations. To rigorously assess performance, they propose three constrained random baselines based on random directions or activation patterns. Results reveal that SAEs recover only about 9% of the true features and show no statistically significant advantage over random baselines across key interpretability metrics—including interpretability scores (0.87 vs. 0.90), sparse probing (0.69 vs. 0.72), and causal editing (0.73 vs. 0.72)—challenging the presumed efficacy of SAEs in current interpretability applications.

Technology Category

Application Category

📝 Abstract

Sparse Autoencoders (SAEs) have emerged as a promising tool for interpreting neural networks by decomposing their activations into sparse sets of human-interpretable features. Recent work has introduced multiple SAE variants and successfully scaled them to frontier models. Despite much excitement, a growing number of negative results in downstream tasks casts doubt on whether SAEs recover meaningful features. To directly investigate this, we perform two complementary evaluations. On a synthetic setup with known ground-truth features, we demonstrate that SAEs recover only $9\%$ of true features despite achieving $71\%$ explained variance, showing that they fail at their core task even when reconstruction is strong. To evaluate SAEs on real activations, we introduce three baselines that constrain SAE feature directions or their activation patterns to random values. Through extensive experiments across multiple SAE architectures, we show that our baselines match fully-trained SAEs in interpretability (0.87 vs 0.90), sparse probing (0.69 vs 0.72), and causal editing (0.73 vs 0.72). Together, these results suggest that SAEs in their current state do not reliably decompose models'internal mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Sparse Autoencoders

interpretability

random baselines

feature recovery

neural network analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders

Interpretability

Random Baselines