Locality-Attending Vision Transformer

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

Although Vision Transformers have demonstrated strong performance in image classification, their global self-attention mechanism lacks explicit modeling of local spatial details, which limits their effectiveness in segmentation tasks. To address this limitation, this work proposes a plug-and-play, learnable Gaussian kernel-modulated self-attention mechanism that refines patch representations after standard classification training, thereby enhancing the model’s focus on local neighborhoods while preserving its global contextual awareness. Notably, the method requires no modification to the original training pipeline and incurs no loss in classification accuracy. Experimental results show consistent and significant improvements across three segmentation benchmarks, including ADE20K, with mIoU gains exceeding 6% for ViT-Tiny and over 4% for ViT-Base.

Technology Category

Application Category

📝 Abstract

Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers'image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model's ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.

Problem

Research questions and friction points this paper is trying to address.

vision transformer

semantic segmentation

locality

self-attention

spatial details

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Transformer

locality-aware attention

Gaussian-modulated self-attention