Scratching Visual Transformer's Back with Uniform Attention

📅 2022-10-16

🏛️ IEEE International Conference on Computer Vision

📈 Citations: 16

✨ Influential: 1

career value

200K/year

🤖 AI Summary

Vision Transformers (ViTs) suffer from implicit preference for dense spatial interactions in their multi-head self-attention (MSA), leading to steep softmax gradients, optimization instability, and limited generalization—contrary to conventional intuition. This work is the first to identify and characterize this phenomenon. We propose Context Broadcasting (CB), a zero-parameter, single-line-deployable mechanism that injects uniform dense attention at each layer to explicitly guide MSA toward satisfying softmax constraints. CB imposes no additional parameters or computational overhead; instead, it reshapes the attention density distribution via lightweight context broadcasting. Evaluated on ImageNet and other benchmarks, CB consistently enhances ViT capacity and generalization, delivering stable accuracy gains across architectures. Our approach offers both a novel analytical lens into Transformer attention dynamics and a practical, implementation-friendly tool for improving attention behavior without architectural modification.

📝 Abstract

The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA), which enables global interactions at each layer of a ViT model. Previous works acknowledge the property of long-range dependency for the effectiveness in MSA. In this work, we study the role of MSA in terms of the different axis, density. Our preliminary analyses suggest that the spatial interactions of learned attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon because dense attention maps are harder for the model to learn due to softmax. We interpret this opposite behavior against softmax as a strong preference for the ViT models to include dense interaction. We thus manually insert the dense uniform attention to each layer of the ViT models to supply the much-needed dense interactions. We call this method Context Broadcasting, CB. Our study demonstrates the inclusion of CB takes the role of dense attention and thereby reduces the degree of density in the original attention maps by complying softmax in MSA. We also show that, with negligible costs of CB (1 line in your model code and no additional parameters), both the capacity and generalizability of the ViT models are increased.

Problem

Research questions and friction points this paper is trying to address.

Visual Transformers

Dense Attention Maps

Model Adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Broadcasting

Visual Transformer

Parameter-Efficient

🔎 Similar Papers

An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels