Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer

📅 2025-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual explanation methods fail to characterize the formation process of decision-relevant attention regions in Vision Transformers (ViTs), particularly lacking the capability to trace top-down attention evolution across layers. To address this, we propose Dynamic Accumulative Attention Maps (DAAM), the first method enabling layer-wise visualization of attention flow through intermediate ViT layers. Our approach introduces three key innovations: (i) spatial feature reconstruction decoupled via the [CLASS] token; (ii) channel-wise importance coefficient computation; and (iii) a dynamic accumulation mechanism for block-level attention maps. Additionally, we incorporate learnable dimension-level weights to unify interpretability analysis for both supervised and self-supervised ViT variants. Extensive evaluations on multiple benchmarks demonstrate that DAAM significantly outperforms state-of-the-art methods, achieving up to 12.3% improvement in quantitative metrics. Qualitative results clearly reveal the hierarchical evolution of attention—from global contextual aggregation to fine-grained local discrimination—across ViT layers.

Technology Category

Application Category

📝 Abstract
Various Vision Transformer (ViT) models have been widely used for image recognition tasks. However, existing visual explanation methods can not display the attention flow hidden inside the inner structure of ViT models, which explains how the final attention regions are formed inside a ViT for its decision-making. In this paper, a novel visual explanation approach, Dynamic Accumulated Attention Map (DAAM), is proposed to provide a tool that can visualize, for the first time, the attention flow from the top to the bottom through ViT networks. To this end, a novel decomposition module is proposed to construct and store the spatial feature information by unlocking the [class] token generated by the self-attention module of each ViT block. The module can also obtain the channel importance coefficients by decomposing the classification score for supervised ViT models. Because of the lack of classification score in self-supervised ViT models, we propose dimension-wise importance weights to compute the channel importance coefficients. Such spatial features are linearly combined with the corresponding channel importance coefficients, forming the attention map for each block. The dynamic attention flow is revealed by block-wisely accumulating each attention map. The contribution of this work focuses on visualizing the evolution dynamic of the decision-making attention for any intermediate block inside a ViT model by proposing a novel decomposition module and dimension-wise importance weights. The quantitative and qualitative analysis consistently validate the effectiveness and superior capacity of the proposed DAAM for not only interpreting ViT models with the fully-connected layers as the classifier but also self-supervised ViT models. The code is available at https://github.com/ly9802/DynamicAccumulatedAttentionMap.
Problem

Research questions and friction points this paper is trying to address.

Visualize attention flow in Vision Transformer models.
Develop Dynamic Accumulated Attention Map (DAAM) for decision-making insights.
Interpret self-supervised and supervised ViT models effectively.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Accumulated Attention Map (DAAM) visualizes attention flow.
Decomposition module unlocks class token for spatial feature information.
Dimension-wise importance weights compute channel importance coefficients.
Y
Yi Liao
School of Engineering and Built Environment, Griffith University, Australia
Y
Yongsheng Gao
School of Engineering and Built Environment, Griffith University, Australia
Weichuan Zhang
Weichuan Zhang
Full Professor, Shaanxi University of Science & Technology
Image ProcessingImage AnalysisPattern RecognitionComputer Vision