SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability

📅 2025-03-12

📈 Citations: 2

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Current sparse autoencoder (SAE) evaluation heavily relies on unsupervised proxy metrics whose correlations with interpretability, feature disentanglement, and practical utility (e.g., model unlearning) remain unvalidated. Method: We introduce the first comprehensive benchmark for SAE interpretability, spanning seven orthogonal dimensions. It comprises a multidimensional metric suite, an open-source collection of 200+ models (covering eight architectures/algorithms), an interactive analysis platform, and an extensible evaluation protocol. Contributions/Results: We empirically demonstrate— for the first time—that widely adopted proxy metrics exhibit significant misalignment with ground-truth interpretability performance. Matryoshka SAE achieves over 40% relative improvement in feature disentanglement. Our framework establishes the first standardized, cross-architecture and cross-training-method comparison infrastructure. This work advances SAE evaluation from heuristic to scientific practice, enabling over one hundred reproducible, large-scale comparative studies.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are a popular technique for interpreting language model activations, and there is extensive recent work on improving SAE effectiveness. However, most prior work evaluates progress using unsupervised proxy metrics with unclear practical relevance. We introduce SAEBench, a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning. To enable systematic comparison, we open-source a suite of over 200 SAEs across eight recently proposed SAE architectures and training algorithms. Our evaluation reveals that gains on proxy metrics do not reliably translate to better practical performance. For instance, while Matryoshka SAEs slightly underperform on existing proxy metrics, they substantially outperform other architectures on feature disentanglement metrics; moreover, this advantage grows with SAE scale. By providing a standardized framework for measuring progress in SAE development, SAEBench enables researchers to study scaling trends and make nuanced comparisons between different SAE architectures and training methodologies. Our interactive interface enables researchers to flexibly visualize relationships between metrics across hundreds of open-source SAEs at: https://saebench.xyz

Problem

Research questions and friction points this paper is trying to address.

Evaluates sparse autoencoders using practical metrics.

Compares SAE architectures on interpretability and disentanglement.

Provides a standardized framework for SAE performance analysis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAEBench evaluates SAEs with diverse metrics

Open-sourced 200+ SAEs for systematic comparison

Interactive interface for visualizing SAE metrics

🔎 Similar Papers

No similar papers found.