Causal Attribution via Activation Patching

πŸ“… 2026-03-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing attribution methods for Vision Transformers (ViTs), which rely on input perturbations and often fail to accurately identify image regions with genuine causal influence on predictions. To overcome this, the authors propose a causal attribution mechanism based on interventions in intermediate layer activations. Specifically, patch-level activations from a source image are embedded into a neutral target context, and the resulting change in the target class score is used as a direct measure of each patch’s causal effect within the model’s internal representation. By operating in the activation space rather than the input space, this approach circumvents the spatial blurring caused by high-level global mixing in ViTs, thereby significantly improving both the fidelity and localization accuracy of attribution maps. Extensive experiments across multiple ViT architectures and standard evaluation metrics demonstrate consistent superiority over current attribution techniques.

Technology Category

Application Category

πŸ“ Abstract
Attribution methods for Vision Transformers (ViTs) aim to identify image regions that influence model predictions, but producing faithful and well-localized attributions remains challenging. Existing gradient-based and perturbation-based techniques often fail to isolate the causal contribution of internal representations associated with individual image patches. The key challenge is that class-relevant evidence is formed through interactions between patch tokens across layers, and input-level perturbations can be poor proxies for patch importance, since they may fail to reconstruct the internal evidence actually used by the model. We propose Causal Attribution via Activation Patching (CAAP), which estimates the contribution of individual image patches to the ViT's prediction by directly intervening on internal activations rather than using learned masks or synthetic perturbation patterns. For each patch, CAAP inserts the corresponding source-image activations into a neutral target context over an intermediate range of layers and uses the resulting target-class score as the attribution signal. The resulting attribution map reflects the causal effect of patch-associated internal representations on the model's prediction. The causal intervention serves as a principled measure of patch influence by capturing class-relevant evidence after initial representation formation, while avoiding late-layer global mixing that can reduce spatial specificity. Across multiple ViT backbones and standard metrics, CAAP significantly outperforms existing methods and produces more faithful and localized attributions.
Problem

Research questions and friction points this paper is trying to address.

causal attribution
Vision Transformers
activation patching
attribution localization
internal representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Attribution
Activation Patching
Vision Transformers
Interpretability
Internal Interventions
πŸ”Ž Similar Papers
No similar papers found.
A
Amirmohammad Izadi
Sharif University of Technology
M
Mohammadali Banayeeanzade
Sharif University of Technology
A
Alireza Mirrokni
Sharif University of Technology
Hosein Hasani
Hosein Hasani
Sharif University of Technology
Machine Learning
M
Mobin Bagherian
Sharif University of Technology
F
Faridoun Mehri
Sharif University of Technology
Mahdieh Soleymani Baghshah
Mahdieh Soleymani Baghshah
Associate Professor, Computer Engineering Department, Sharif University of Technology
Deep LearningMachine Learning