On the Theoretical Understanding of Identifiable Sparse Autoencoders and Beyond

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Sparse autoencoders (SAEs) are widely employed to interpret internal features of large language models (LLMs), yet theoretical guarantees for uniquely and accurately recovering ground-truth monosemantic features from superimposed polysemantic representations remain lacking. Method: We establish, for the first time, necessary and sufficient conditions for SAE identifiability—rigorously characterizing the theoretical boundary under which monosemantic features are recoverable—and propose a theoretically grounded weighted reconstruction strategy that enforces feature disentanglement via a principled loss-function design. Results: Empirical evaluation confirms the practical validity of our identifiability condition: the weighted SAE achieves substantial gains over uniform-weight baselines in monosemanticity (+32.7%) and human interpretability (+28.4%), providing both foundational theoretical support and an actionable tool for trustworthy, interpretable AI.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting features learned by large language models (LLMs). It aims to recover complex superposed polysemantic features into interpretable monosemantic ones through feature reconstruction via sparsely activated neural networks. Despite the wide applications of SAEs, it remains unclear under what conditions an SAE can fully recover the ground truth monosemantic features from the superposed polysemantic ones. In this paper, through theoretical analysis, we for the first time propose the necessary and sufficient conditions for identifiable SAEs (SAEs that learn unique and ground truth monosemantic features), including 1) extreme sparsity of the ground truth feature, 2) sparse activation of SAEs, and 3) enough hidden dimensions of SAEs. Moreover, when the identifiable conditions are not fully met, we propose a reweighting strategy to improve the identifiability. Specifically, following the theoretically suggested weight selection principle, we prove that the gap between the loss functions of SAE reconstruction and monosemantic feature reconstruction can be narrowed, so that the reweighted SAEs have better reconstruction of the ground truth monosemantic features than the uniformly weighted ones. In experiments, we validate our theoretical findings and show that our weighted SAE significantly improves feature monosemanticity and interpretability.

Problem

Research questions and friction points this paper is trying to address.

Identify conditions for SAEs to recover monosemantic features

Propose reweighting strategy to enhance feature identifiability

Validate theoretical findings with improved feature interpretability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifiable SAEs need extreme sparsity conditions

Reweighting strategy narrows reconstruction loss gap

Weighted SAEs enhance feature interpretability significantly

🔎 Similar Papers

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders