Steering CLIP's vision transformer with sparse autoencoders

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limited interpretability and controllability of CLIP’s vision transformer. We propose a layer-wise feature analysis framework based on sparse autoencoders (SAEs), training an SAE at each vision transformer layer to systematically disentangle and quantify the steerability of neural features—enabling, for the first time, a comprehensive assessment of CLIP’s visual pathway controllability. We find that mid-level layers exhibit optimal feature disentanglement; moreover, SAEs yield thousands of highly steerable semantic neurons, constituting only 10–15% of the original model’s parameters. Our method significantly improves performance across three diverse visual disentanglement tasks: CelebA attribute editing, Waterbirds bias mitigation, and typographic attack defense—achieving state-of-the-art results in the latter. This work establishes a scalable, interpretable, and empirically verifiable paradigm for intervening in large-model visual representations.

Technology Category

Application Category

📝 Abstract

While vision models are highly capable, their internal mechanisms remain poorly understood -- a challenge which sparse autoencoders (SAEs) have helped address in language, but which remains underexplored in vision. We address this gap by training SAEs on CLIP's vision transformer and uncover key differences between vision and language processing, including distinct sparsity patterns for SAEs trained across layers and token types. We then provide the first systematic analysis on the steerability of CLIP's vision transformer by introducing metrics to quantify how precisely SAE features can be steered to affect the model's output. We find that 10-15% of neurons and features are steerable, with SAEs providing thousands more steerable features than the base model. Through targeted suppression of SAE features, we then demonstrate improved performance on three vision disentanglement tasks (CelebA, Waterbirds, and typographic attacks), finding optimal disentanglement in middle model layers, and achieving state-of-the-art performance on defense against typographic attacks.

Problem

Research questions and friction points this paper is trying to address.

Understanding internal mechanisms of vision models using sparse autoencoders

Analyzing steerability of CLIP's vision transformer with SAE features

Improving vision disentanglement tasks via targeted SAE feature suppression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse autoencoders analyze CLIP's vision transformer

Metrics quantify steerability of SAE features

Targeted suppression improves vision disentanglement tasks

🔎 Similar Papers

Vision Transformer with Sparse Scan Prior