Transformer Interpretability from Perspective of Attention and Gradient

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

193K/year
🤖 AI Summary
This work addresses the limited interpretability of existing Vision Transformer (ViT) models in visual tasks, which obscures the discrepancy between model attention and human perception. The authors propose a novel method that integrates attention mechanisms with gradient-based guidance to enable fine-grained interpretation of ViT feature responses. This approach not only provides more comprehensive visualization of the model’s decision rationale but also reveals, for the first time, that ViTs can be misled into incorrect classifications by imperceptibly small input perturbations—perturbations nearly undetectable to the human eye. The study successfully generates highly inconspicuous adversarial examples, thereby uncovering latent security vulnerabilities in transformer-based architectures and offering new insights into their internal decision-making processes.
📝 Abstract
Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention and gradient, we conduct an in-depth study of Transformer interpretation and propose a method to achieve it by guiding the gradient direction, or more precisely, the attention direction. The method enables more comprehensive interpretation of feature regions, offers detail interpretation, and helps to better understand Transformer mechanism. Leveraging the difference in how Vision Transformer (ViT) and humans perceive images, we alter the class of an image in a way that is almost imperceptible to the human eye. This class rewriting phenomenon may potentially pose security risks in certain scenarios.
Problem

Research questions and friction points this paper is trying to address.

Transformer Interpretability
Attention Mechanism
Gradient-based Explanation
Vision Transformer
Adversarial Perturbation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer interpretability
attention mechanism
gradient-based explanation
class rewriting
Vision Transformer