Interpreting CLIP with Hierarchical Sparse Autoencoders

📅 2025-02-27

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing sparse autoencoders (SAEs) struggle to simultaneously achieve high reconstruction fidelity and strong sparsity, limiting the interpretability of multimodal large models (e.g., CLIP, SigLIP). To address this, we propose the hierarchical Matryoshka SAE (MSAE), the first architecture to jointly optimize reconstruction quality and sparsity across multiple granularities, thereby overcoming the classical trade-off inherent in conventional SAEs. Our method integrates hierarchical sparse coding, multi-granularity feature disentanglement, and semantic concept extraction, enabling concept-level similarity retrieval and bias analysis. Evaluated on CLIP, MSAE achieves a cosine similarity of 0.99, <0.1 unexplained variance, and ~80% sparsity, while successfully disentangling over 120 interpretable vision-language semantic concepts. This substantially enhances interpretability for downstream tasks—demonstrated by improved analytical capabilities on CelebA.

Technology Category

Application Category

📝 Abstract

Sparse autoencoders (SAEs) are useful for detecting and steering interpretable features in neural networks, with particular potential for understanding complex multimodal representations. Given their ability to uncover interpretable features, SAEs are particularly valuable for analyzing large-scale vision-language models (e.g., CLIP and SigLIP), which are fundamental building blocks in modern systems yet remain challenging to interpret and control. However, current SAE methods are limited by optimizing both reconstruction quality and sparsity simultaneously, as they rely on either activation suppression or rigid sparsity constraints. To this end, we introduce Matryoshka SAE (MSAE), a new architecture that learns hierarchical representations at multiple granularities simultaneously, enabling a direct optimization of both metrics without compromise. MSAE establishes a new state-of-the-art Pareto frontier between reconstruction quality and sparsity for CLIP, achieving 0.99 cosine similarity and less than 0.1 fraction of variance unexplained while maintaining ~80% sparsity. Finally, we demonstrate the utility of MSAE as a tool for interpreting and controlling CLIP by extracting over 120 semantic concepts from its representation to perform concept-based similarity search and bias analysis in downstream tasks like CelebA.

Problem

Research questions and friction points this paper is trying to address.

Improves interpretability of CLIP using hierarchical sparse autoencoders.

Optimizes reconstruction quality and sparsity simultaneously without compromise.

Extracts semantic concepts for similarity search and bias analysis.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical sparse autoencoders for multimodal analysis

Matryoshka SAE optimizes reconstruction and sparsity simultaneously

Extracts semantic concepts for similarity search and bias analysis

🔎 Similar Papers

No similar papers found.