To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

A key bottleneck in large vision-language models (LVLMs) is the inefficient transfer of visual information from Vision Transformers (ViTs) to language models, particularly due to the poorly understood semantic roles of high-norm visual tokens—termed “ViT attention sinks.” Method: We propose the first systematic characterization and empirical validation that attention sinks encode high-level semantics and substantially enhance visual reasoning. Our approach integrates training-agnostic analysis with training-aware optimization, leveraging attention tracing, feature attribution, and token-guided routing to explicitly strengthen the ViT-to-LLM information flow. Contribution/Results: The method consistently improves performance across diverse LVLM architectures (e.g., LLaVA, Qwen-VL) and visual reasoning benchmarks (e.g., POPE, MME, MMStar), demonstrating strong cross-model generalizability. It establishes the semantic significance of attention sinks and advances multimodal information routing—filling a critical gap in understanding visual token roles and inter-modal information flow mechanisms.

Technology Category

Application Category

📝 Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Problem

Research questions and friction points this paper is trying to address.

Identifying high-norm visual tokens in Vision Transformers

Analyzing propagation effectiveness of visual signals to LLMs

Improving visual reasoning by leveraging ViT attention sinks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies high-norm ViT attention sink tokens

Proposes training-free and training-based enhancement methods

Leverages ViT sinks to improve visual reasoning tasks

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions