Probing the Representational Power of Sparse Autoencoders in Vision Models

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work systematically evaluates the representational capacity and application potential of Sparse Autoencoders (SAEs) in vision models. Addressing key challenges—including weak interpretability, poor out-of-distribution generalization, and low controllability in generative modeling—we uniformly deploy SAEs across diverse architectures: Vision Transformers, multimodal large models (VLMs), and diffusion models. Our method decodes hidden states to construct a pipeline for semantic feature discovery and automated attribute extraction. Experiments demonstrate that SAE-extracted neurons exhibit strong semantic interpretability; enable reliable out-of-distribution detection and model ontology recovery; reveal cross-modal shared representations in VLMs; and support highly controllable text-guided image generation via integration with frozen text encoders. Collectively, these results establish a novel paradigm for interpretable representation learning and controllable generation in vision models.

Technology Category

Application Category

📝 Abstract
Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.
Problem

Research questions and friction points this paper is trying to address.

Evaluating SAEs' representational power in vision models
Assessing SAE features for interpretability and generalization
Exploring SAEs in multi-modal vision-language representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders for vision model interpretability
SAEs enhance out-of-distribution generalization
SAEs enable controllable generation in diffusion models
🔎 Similar Papers
No similar papers found.
M
Matthew Lyle Olson
Intel Labs, Santa Clara, CA, USA
M
Musashi Hinck
Intel Labs, Santa Clara, CA, USA
N
Neale Ratzlaff
Intel Labs, Santa Clara, CA, USA
C
Changbai Li
Oregon State University, Corvallis, OR, USA
Phillip Howard
Phillip Howard
Lead AI Researcher, Thoughtworks
Artificial intelligencemachine learningresponsible AIsynthetic datainterpretability
Vasudev Lal
Vasudev Lal
Oracle
AIDeep LearningCVNLP
Shao-Yen Tseng
Shao-Yen Tseng
Intel Labs
Machine LearningNatural Language ProcessingSpeech and Audio Processing