There is More to Attention: Statistical Filtering Enhances Explanations in Vision Transformers

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing ViT interpretability methods rely solely on single-layer attention weights, making them vulnerable to noise; meanwhile, CNN-adapted attribution techniques degrade significantly when transferred to ViTs. To address this, we propose StatAtt—a statistical filtering-based multi-layer attention fusion framework that jointly models cross-layer attention map saliency and calibrates MLP feature responses. This design suppresses noise, enhances class sensitivity, and incorporates human eye-tracking data for evaluation. StatAtt requires no additional training and is computationally efficient. On benchmarks including ImageNet, StatAtt produces sharper, more semantically concentrated attribution maps. It outperforms or matches state-of-the-art methods across quantitative metrics (e.g., AUC, Faithfulness) and human perceptual alignment. Our results empirically validate that statistically refined attention signals yield high explanatory value for ViT interpretation.

Technology Category

Application Category

📝 Abstract

Explainable AI (XAI) has become increasingly important with the rise of large transformer models, yet many explanation methods designed for CNNs transfer poorly to Vision Transformers (ViTs). Existing ViT explanations often rely on attention weights, which tend to yield noisy maps as they capture token-to-token interactions within each layer.While attribution methods incorporating MLP blocks have been proposed, we argue that attention remains a valuable and interpretable signal when properly filtered. We propose a method that combines attention maps with a statistical filtering, initially proposed for CNNs, to remove noisy or uninformative patterns and produce more faithful explanations. We further extend our approach with a class-specific variant that yields discriminative explanations. Evaluation against popular state-of-the-art methods demonstrates that our approach produces sharper and more interpretable maps. In addition to perturbation-based faithfulness metrics, we incorporate human gaze data to assess alignment with human perception, arguing that human interpretability remains essential for XAI. Across multiple datasets, our approach consistently outperforms or is comparable to the SOTA methods while remaining efficient and human plausible.

Problem

Research questions and friction points this paper is trying to address.

Improving noisy attention maps in Vision Transformers for clearer explanations

Enhancing explanation faithfulness through statistical filtering of attention patterns

Developing human-aligned interpretable explanations using gaze data evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical filtering removes noise from attention maps

Combining attention with filtering produces faithful explanations

Class-specific variant generates discriminative visual explanations

🔎 Similar Papers

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers