🤖 AI Summary
Sparse autoencoders (SAEs) face a fundamental trade-off between dictionary size expansion and high-level concept collapse: while larger dictionaries enable finer-grained feature capture, they often disrupt hierarchical abstraction, leading to loss or distortion of high-level semantics. To address this, we propose Matryoshka SAE—a multi-level nested architecture that enforces independent reconstruction of the input via progressively smaller sub-dictionaries, thereby enabling hierarchical feature learning from general to specific. Our method imposes hierarchical reconstruction constraints and joint sparsity optimization across levels. End-to-end training on Gemma-2-2B and TinyStories demonstrates that Matryoshka SAE significantly mitigates feature absorption, improves sparse probing and concept erasure performance, and enhances feature disentanglement and interpretability. Crucially, it preserves high-level abstraction capacity while enabling scalable dictionary expansion—resolving the tension between expressivity and semantic coherence in sparse representation learning.
📝 Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting neural networks by extracting the concepts represented in their activations. However, choosing the size of the SAE dictionary (i.e. number of learned concepts) creates a tension: as dictionary size increases to capture more relevant concepts, sparsity incentivizes features to be split or absorbed into more specific features, leaving high-level features missing or warped. We introduce Matryoshka SAEs, a novel variant that addresses these issues by simultaneously training multiple nested dictionaries of increasing size, forcing the smaller dictionaries to independently reconstruct the inputs without using the larger dictionaries. This organizes features hierarchically - the smaller dictionaries learn general concepts, while the larger dictionaries learn more specific concepts, without incentive to absorb the high-level features. We train Matryoshka SAEs on Gemma-2-2B and TinyStories and find superior performance on sparse probing and targeted concept erasure tasks, more disentangled concept representations, and reduced feature absorption. While there is a minor tradeoff with reconstruction performance, we believe Matryoshka SAEs are a superior alternative for practical tasks, as they enable training arbitrarily large SAEs while retaining interpretable features at different levels of abstraction.