Interpreting vision transformers via residual replacement model

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Understanding the internal representational mechanisms of Vision Transformers (ViTs) remains challenging due to their opaque, distributed representations. Method: We introduce a residual-stream feature extraction framework based on sparse autoencoders (SAEs), applied across all ViT layers to identify approximately 6.6K interpretable neuron-level features. We further construct a residual replacement model that reconstructs the full forward pass as a concise, faithful, and interpretable computational circuit. Contribution/Results: Our approach systematically uncovers hierarchical feature evolution—from low-level textures and edges to high-level semantics (e.g., curves, spatial positions)—and enables precise human-understandable reasoning about ViT decisions. It further facilitates the detection and mitigation of spurious correlations, demonstrating practical utility in bias correction. By scaling SAE-based interpretability to the entire ViT architecture, our work establishes an extensible methodological paradigm for mechanistic interpretability in large foundation models.

Technology Category

Application Category

📝 Abstract

How do vision transformers (ViTs) represent and process the world? This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers, extracted via sparse autoencoders, and by introducing the residual replacement model, which replaces ViT computations with interpretable features in the residual stream. Our analysis reveals not only a feature evolution from low-level patterns to high-level semantics, but also how ViTs encode curves and spatial positions through specialized feature types. The residual replacement model scalably produces a faithful yet parsimonious circuit for human-scale interpretability by significantly simplifying the original computations. As a result, this framework enables intuitive understanding of ViT mechanisms. Finally, we demonstrate the utility of our framework in debiasing spurious correlations.

Problem

Research questions and friction points this paper is trying to address.

Analyzing how vision transformers represent and process visual information

Developing interpretable feature models to explain ViT computations

Simplifying complex ViT mechanisms for human-scale understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematic analysis of 6.6K features via sparse autoencoders

Introducing residual replacement model for interpretable features

Scalably produces faithful, parsimonious circuits for interpretability

🔎 Similar Papers

No similar papers found.