🤖 AI Summary
Existing ViT interpretability methods rely solely on single-layer attention weights, making them vulnerable to noise; meanwhile, CNN-adapted attribution techniques degrade significantly when transferred to ViTs. To address this, we propose StatAtt—a statistical filtering-based multi-layer attention fusion framework that jointly models cross-layer attention map saliency and calibrates MLP feature responses. This design suppresses noise, enhances class sensitivity, and incorporates human eye-tracking data for evaluation. StatAtt requires no additional training and is computationally efficient. On benchmarks including ImageNet, StatAtt produces sharper, more semantically concentrated attribution maps. It outperforms or matches state-of-the-art methods across quantitative metrics (e.g., AUC, Faithfulness) and human perceptual alignment. Our results empirically validate that statistically refined attention signals yield high explanatory value for ViT interpretation.
📝 Abstract
Explainable AI (XAI) has become increasingly important with the rise of large transformer models, yet many explanation methods designed for CNNs transfer poorly to Vision Transformers (ViTs). Existing ViT explanations often rely on attention weights, which tend to yield noisy maps as they capture token-to-token interactions within each layer.While attribution methods incorporating MLP blocks have been proposed, we argue that attention remains a valuable and interpretable signal when properly filtered. We propose a method that combines attention maps with a statistical filtering, initially proposed for CNNs, to remove noisy or uninformative patterns and produce more faithful explanations. We further extend our approach with a class-specific variant that yields discriminative explanations. Evaluation against popular state-of-the-art methods demonstrates that our approach produces sharper and more interpretable maps. In addition to perturbation-based faithfulness metrics, we incorporate human gaze data to assess alignment with human perception, arguing that human interpretability remains essential for XAI. Across multiple datasets, our approach consistently outperforms or is comparable to the SOTA methods while remaining efficient and human plausible.