🤖 AI Summary
This work addresses the limitations of current sparse autoencoder (SAE) evaluations, which are hindered by noisy real-world data and small-scale synthetic datasets lacking realism, making fair architectural comparisons difficult. To overcome this, we introduce SynthSAEBench, a toolkit that generates large-scale synthetic data with controllable feature correlations, hierarchical structure, and superposition properties. We further establish SynthSAEBench-16k, a standardized benchmark model based on this framework. For the first time, our approach reproduces key phenomena observed in LLM SAEs—such as the decoupling between reconstruction fidelity and latent quality, and poor probing performance—within a synthetic setting. Additionally, it uncovers a novel failure mode in Matching Pursuit SAEs caused by overfitting to superposition noise, thereby providing a fine-grained, interpretable, and verifiable foundation for evaluating and optimizing SAE architectures.
📝 Abstract
Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.