On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond

📅 2025-06-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse autoencoders (SAEs) are widely employed to interpret internal features of large language models (LLMs), yet theoretical guarantees for uniquely and accurately recovering ground-truth monosemantic features from superimposed polysemantic representations remain lacking. Method: We establish, for the first time, necessary and sufficient conditions for SAE identifiability—rigorously characterizing the theoretical boundary under which monosemantic features are recoverable—and propose a theoretically grounded weighted reconstruction strategy that enforces feature disentanglement via a principled loss-function design. Results: Empirical evaluation confirms the practical validity of our identifiability condition: the weighted SAE achieves substantial gains over uniform-weight baselines in monosemanticity (+32.7%) and human interpretability (+28.4%), providing both foundational theoretical support and an actionable tool for trustworthy, interpretable AI.

Technology Category

Application Category

📝 Abstract
Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs). It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks. Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, through theoretical analysis, we for the first time propose the necessary and sufficient conditions for identifiable SAEs (SAEs that learn unique and ground truth monosemantic features), including 1) extreme sparsity of the ground truth feature, 2) sparse activation of SAEs, and 3) enough hidden dimensions of SAEs. Moreover, when the identifiable conditions are not fully met, we propose a reweighting strategy to improve the identifiability. Specifically, following the theoretically suggested weight selection principle, we prove that the gap between the loss functions of SAE reconstruction and monosemantic feature reconstruction can be narrowed, so that the reweighted SAEs have better reconstruction of the ground truth monosemantic features than the uniformly weighted ones. In experiments, we validate our theoretical findings and show that our weighted SAE significantly improves feature monosemanticity and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Identify conditions for SAEs to recover monosemantic features
Propose reweighting strategy to enhance feature identifiability
Validate theoretical findings with improved feature interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifiable SAEs need extreme sparsity conditions
Reweighting strategy narrows reconstruction loss gap
Weighted SAEs enhance feature interpretability significantly
🔎 Similar Papers
No similar papers found.
Jingyi Cui
Jingyi Cui
Peking University
self-supervised learningweakly supervised learningstatistical machine learning
Q
Qi Zhang
State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University
Y
Yifei Wang
CSAIL, MIT
Yisen Wang
Yisen Wang
Assistant Professor, Peking University
Machine LearningSelf-Supervised LearningLarge Language ModelsSafety