Do Sparse Autoencoders Capture Concept Manifolds?

📅 2026-04-30

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This study investigates whether sparse autoencoders (SAEs) can effectively capture conceptual structures in neural network representations that reside on low-dimensional manifolds, rather than relying solely on isolated linear directions. Through theoretical analysis and empirical evaluation, the work systematically identifies two distinct mechanisms by which SAEs model such manifolds: global subspace modeling and local region tiling. The authors demonstrate that current SAEs often blend these strategies, leading to “diluted” manifold representations—where the structure is fragmented across multiple atomic features and thus difficult to interpret explicitly. To address this, the paper proposes treating geometric manifolds as fundamental units of interpretability and shows that clustering atomic features into coherent groups enables more effective recovery of continuous conceptual structures.

📝 Abstract

Sparse autoencoders (SAEs) are widely used to extract interpretable features from neural network representations, often under the implicit assumption that concepts correspond to independent linear directions. However, a growing body of evidence suggests that many concepts are instead organized along low-dimensional manifolds encoding continuous geometric relationships. This raises three basic questions: what does it mean for an SAE to capture a manifold, when do existing SAE architectures do so, and how? We develop a theoretical framework that answers these questions and show that SAEs can capture manifolds in two fundamentally different ways: globally, by allocating a compact group of atoms whose linear span contains the entire manifold, or locally, by distributing it across features that each selectively tile a restricted region of the underlying geometry. Empirically, we find that SAEs suboptimally recover continuous structures, mixing the global subspace and local tiling solutions in a fragmented regime we call dilution. This explains why manifold structure is rarely visible at the level of individual concepts and motivates post-hoc unsupervised discovery methods that search for coherent groups of atoms rather than isolated directions. More broadly, our results suggest that future representation learning methods should treat geometric objects, not just individual directions, as the basic units of interpretability.

Problem

Research questions and friction points this paper is trying to address.

sparse autoencoders

concept manifolds

interpretable features

low-dimensional manifolds

representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

sparse autoencoders

concept manifolds

interpretability