Use Sparse Autoencoders to Discover Unknown Concepts, Not to Act on Known Concepts

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

241K/year

🤖 AI Summary

Sparse autoencoders (SAEs) underperform on known-concept intervention tasks, raising questions about their practical utility. This paper argues that their true value lies in *unknown concept discovery*—a distinction we formally articulate and rigorously separate from concept intervention for the first time. Leveraging a unified framework that integrates representation learning with conceptual decomposition, we systematically validate SAEs’ efficacy in unsupervised discovery of latent semantic concepts within high-dimensional data. This paradigmatic distinction reconciles previously contradictory empirical findings. Moreover, it establishes a novel, principled foundation for interpretable AI, model safety auditing, fairness evaluation, and detection of implicit patterns in social and health sciences—enhancing model transparency and broadening cross-disciplinary applicability.

Technology Category

Application Category

📝 Abstract

While sparse autoencoders (SAEs) have generated significant excitement, a series of negative results have added to skepticism about their usefulness. Here, we establish a conceptual distinction that reconciles competing narratives surrounding SAEs. We argue that while SAEs may be less effective for acting on known concepts, SAEs are powerful tools for discovering unknown concepts. This distinction cleanly separates existing negative and positive results, and suggests several classes of SAE applications. Specifically, we outline use cases for SAEs in (i) ML interpretability, explainability, fairness, auditing, and safety, and (ii) social and health sciences.

Problem

Research questions and friction points this paper is trying to address.

Distinguish SAEs' effectiveness for known vs unknown concepts

Reconcile competing narratives about SAEs' utility

Outline SAE applications in interpretability and sciences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders discover unknown concepts

SAEs separate known and unknown concept results

SAEs apply in interpretability and social sciences

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models