Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Vision-language models (VLMs) suffer substantial performance degradation in complex visual scenes. Method: This paper proposes CARVE (Contrastive Attention Refinement via Entropy), a training-free, lightweight enhancement method grounded in intrinsic attention mechanisms. CARVE identifies a strong correlation between visual complexity and attention entropy, revealing an evolutionary pattern wherein attention shifts from global scanning to task-directed focusing. Theoretically, it proves that contrasting attention maps derived from generic and task-specific queries effectively disentangles semantic signals from visual noise. By performing pixel-level contrastive analysis across multiple query attention maps—and integrating attention pattern analysis with information decomposition—CARVE achieves fine-grained visual signal enhancement. Results: Evaluated on multiple open-source VLMs, CARVE improves inference accuracy by up to 75%, significantly boosting visual understanding and reasoning capabilities in complex scenes.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments. While existing enhancement approaches require additional training, rely on external segmentation tools, or operate at coarse-grained levels, they overlook the innate ability within VLMs. To bridge this gap, we investigate VLMs' attention patterns and discover that: (1) visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance; (2) attention progressively refines from global scanning in shallow layers to focused convergence in deeper layers, with convergence degree determined by visual complexity. (3) Theoretically, we prove that the contrast of attention maps between general queries and task-specific queries enables the decomposition of visual signal into semantic signals and visual noise components. Building on these insights, we propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level. Extensive experiments demonstrate that CARVE consistently enhances performance, achieving up to 75% improvement on open-source models. Our work provides critical insights into the interplay between visual complexity and attention mechanisms, offering an efficient pathway for improving visual reasoning with contrasting attention.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VLMs' visual reasoning in complex environments

Addressing performance degradation without additional training

Extracting task-relevant signals through attention contrasting

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free attention contrasting method

Pixel-level visual signal extraction

Enhancing VLMs without external tools

🔎 Similar Papers

No similar papers found.