🤖 AI Summary
Sparse autoencoders (SAEs) underperform on known-concept intervention tasks, raising questions about their practical utility. This paper argues that their true value lies in *unknown concept discovery*—a distinction we formally articulate and rigorously separate from concept intervention for the first time. Leveraging a unified framework that integrates representation learning with conceptual decomposition, we systematically validate SAEs’ efficacy in unsupervised discovery of latent semantic concepts within high-dimensional data. This paradigmatic distinction reconciles previously contradictory empirical findings. Moreover, it establishes a novel, principled foundation for interpretable AI, model safety auditing, fairness evaluation, and detection of implicit patterns in social and health sciences—enhancing model transparency and broadening cross-disciplinary applicability.
📝 Abstract
While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.