🤖 AI Summary
Existing sparse autoencoders (SAEs) predominantly employ shallow architectures and implicitly rely on the quasi-orthogonality assumption, limiting their ability to extract strongly correlated features and thereby hindering the interpretability of neural representations. This work identifies and analyzes this limitation using MNIST as a benchmark. To address it, we propose the Matching Pursuit-based Multi-Iterative SAE (MP-SAE), the first SAE framework to explicitly integrate matching pursuit. MP-SAE performs residual-guided, hierarchical iterative optimization, eliminating dependence on quasi-orthogonality. Crucially, each atom selection step monotonically reduces the reconstruction error, enabling interpretable and provably convergent reconstruction of correlated features. Experiments demonstrate that MP-SAE significantly outperforms conventional SAEs in both reconstruction fidelity and feature disentanglement. By decoupling sparsity from orthogonality constraints, MP-SAE establishes a novel paradigm for robust, sparse modeling of neural representations.
📝 Abstract
Sparse autoencoders (SAEs) have recently become central tools for interpretability, leveraging dictionary learning principles to extract sparse, interpretable features from neural representations whose underlying structure is typically unknown. This paper evaluates SAEs in a controlled setting using MNIST, which reveals that current shallow architectures implicitly rely on a quasi-orthogonality assumption that limits the ability to extract correlated features. To move beyond this, we introduce a multi-iteration SAE by unrolling Matching Pursuit (MP-SAE), enabling the residual-guided extraction of correlated features that arise in hierarchical settings such as handwritten digit generation while guaranteeing monotonic improvement of the reconstruction as more atoms are selected.