Dissecting Query-Key Interaction in Vision Transformers

📅 2024-04-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
The hierarchical role of query-key interactions in Vision Transformers (ViTs) for feature perception in image classification remains poorly understood—particularly the hypothesis that early layers attend to similar features while later layers focus on discriminative ones. Method: We propose an interpretable analytical framework based on singular value decomposition (SVD) of attention weight matrices, enabling layer-wise characterization of semantic interactions across ViT layers. Contribution/Results: Our analysis systematically reveals that lower layers predominantly model intra-object and part-level similarities—supporting perceptual grouping—whereas higher layers emphasize foreground-background separation and inter-class dissimilarity—facilitating contextual understanding. Evaluated across multiple ViT architectures, the method yields semantically coherent, hierarchically structured attention visualizations. It empirically validates a layered synergy between perceptual grouping and contextual modeling, establishing a novel, reproducible paradigm for Transformer interpretability with publicly available tools.

Technology Category

Application Category

📝 Abstract
Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to analyze the query-key interaction by the singular value decomposition of the interaction matrix (i.e. ${ extbf{W}_q}^ op extbf{W}_k$). We find that in many ViTs, especially those with classification training objectives, early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grouping and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images.
Problem

Research questions and friction points this paper is trying to address.

Visual Transformer Models
Attention Mechanism Dynamics
Image Classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Singular Value Decomposition
Visual Transformers
Attention Mechanism
🔎 Similar Papers
No similar papers found.
Xu Pan
Xu Pan
Harvard University
computational neurosciencedeep learning
A
Aaron Philip
Michigan State University
Z
Ziqian Xie
University of Texas Health Science Center at Houston
Odelia Schwartz
Odelia Schwartz
University of Miami