Are Sparse Autoencoder Benchmarks Reliable?

📅 2026-05-18

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

It remains unclear whether existing evaluation benchmarks for sparse autoencoders (SAEs) can reliably distinguish model performance. This work presents the first systematic audit of the effectiveness of metrics in the widely used SAEBench suite, examining three key dimensions: robustness to resampling noise, correlation with ground-truth SAEs on synthetic data, and discriminability of training trajectories. Through resampling analyses, synthetic data construction, probe perturbations, and discriminative power tests, we find that commonly used metrics such as TPP and SCR are unreliable under standard settings, with most suffering from high noise sensitivity and low discriminative capacity. Our findings reveal serious deficiencies in current SAE evaluation practices, indicating that while sae-probes emerge as the most reliable method, they remain limited—highlighting an urgent need for more robust evaluation frameworks.

📝 Abstract

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Sparse Autoencoders

Benchmarks

Interpretability

Evaluation Metrics

Reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoders

interpretability

evaluation benchmarks