See What You Are Told: Visual Attention Sink in Large Multimodal Models

📅 2025-03-05

📈 Citations: 1

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This paper identifies a “visual attention sink” phenomenon in large multimodal models (LMMs): certain visual tokens consistently receive high attention weights despite being semantically irrelevant to the text, thereby impairing cross-modal alignment. To address this, we propose Visual Attention Redistribution (VAR), a training-free, plug-and-play method that recalibrates attention weights via decoder attention analysis, hidden-state activation diagnostics, and identification of centralized attention heads. VAR breaks from conventional optimization paradigms reliant on fine-tuning or architectural modifications—introducing zero trainable parameters and incurring no inference overhead. Extensive evaluation demonstrates significant improvements across general vision-language understanding, visual hallucination suppression, and vision-centric tasks. Our results validate an effective, interpretable pathway for optimizing attention mechanisms in LMMs through post-hoc, analysis-driven weight redistribution.

Technology Category

Application Category

📝 Abstract

Large multimodal models (LMMs)"see"images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

Problem

Research questions and friction points this paper is trying to address.

LMMs allocate high attention to irrelevant visual tokens.

Visual attention sink caused by hidden state activation.

VAR redistributes attention to improve LMM performance.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Attention Redistribution (VAR) method introduced

VAR redistributes attention in image-centric heads

VAR enhances LMM performance without additional training

🔎 Similar Papers

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision