Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Sparse autoencoders (SAEs) suffer from dictionary non-reproducibility and unstable concept interpretability in large vision models. To address this, we propose a geometrically anchored stable representation learning framework: we introduce the first integration of archetypal analysis into SAEs, yielding A-SAE and its robust variant RA-SAE; convex geometric constraints anchor dictionary atoms strictly within the data’s convex hull, ensuring structural interpretability and training robustness. We further establish the first dual benchmark—“plausibility” and “identifiability”—specifically for interpretable dictionaries, and design a synthetic concept-mixing disentanglement evaluation protocol. Experiments demonstrate that RA-SAE significantly improves dictionary stability and semantic consistency on benchmarks including ImageNet, successfully extracting novel, human-interpretable visual concepts while achieving state-of-the-art reconstruction performance.

Technology Category

Application Category

📝 Abstract
Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler&Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover"true"classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.
Problem

Research questions and friction points this paper is trying to address.

Enhance stability in dictionary learning
Improve interpretability of large vision models
Develop benchmarks for dictionary quality assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Archetypal SAEs enhance dictionary stability
RA-SAEs match top reconstruction abilities
New benchmarks assess dictionary quality rigorously
🔎 Similar Papers
No similar papers found.