Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Current visual model interpretability research faces two key bottlenecks: (1) features lack causal validation capability, and (2) controllable editing remains infeasible without fine-tuning. This paper introduces the first unified framework based on sparse autoencoders (SAEs) that jointly enables discovery of human-interpretable visual features, disentangled representation learning, and causal intervention validation. Methodologically, we systematically integrate SAEs into internal visual model representations for both interpretability modeling and causal testing—performing targeted, activation-level interventions to conduct controlled hypothesis-testing experiments. Evaluated across multiple state-of-the-art vision models, our framework consistently identifies and manipulates semantically coherent high-level visual features (e.g., texture, parts, object attributes), with interventions demonstrating reproducibility and attribution fidelity. Code, pretrained models, and an interactive demo are publicly released.

Technology Category

Application Category

📝 Abstract

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

Problem

Research questions and friction points this paper is trying to address.

Interpret vision models rigorously

Test causal influence of features

Manipulate interpretable visual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Autoencoders interpret vision models

SAEs manipulate interpretable visual features

Framework validates without model re-training

🔎 Similar Papers

No similar papers found.

Authors to Follow