VERA: Identifying and Leveraging Visual Evidence Retrieval Heads in Long-Context Understanding

📅 2026-02-09

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Existing vision-language models struggle to effectively leverage visual evidence in long-context and complex reasoning tasks. This work identifies and validates, for the first time, the presence of sparse, dynamic visual evidence retrieval (VER) attention heads within these models, demonstrating their causal role in locating critical visual cues. Building on this insight, we propose VERA—a training-free framework that enhances decision-making by explicitly retrieving relevant visual evidence when model uncertainty is detected via entropy-driven signals. Evaluated across five benchmarks, VERA significantly improves long-context understanding in open-source models, yielding relative average performance gains of 21.3% for Qwen3-VL-8B-Instruct and 20.1% for GLM-4.1V-Thinking.

Technology Category

Application Category

📝 Abstract

While Vision-Language Models (VLMs) have shown promise in textual understanding, they face significant challenges when handling long context and complex reasoning tasks. In this paper, we dissect the internal mechanisms governing long-context processing in VLMs to understand their performance bottlenecks. Through the lens of attention analysis, we identify specific Visual Evidence Retrieval (VER) Heads - a sparse, dynamic set of attention heads critical for locating visual cues during reasoning, distinct from static OCR heads. We demonstrate that these heads are causal to model performance; masking them leads to significant degradation. Leveraging this discovery, we propose VERA (Visual Evidence Retrieval Augmentation), a training-free framework that detects model uncertainty (i.e., entropy) to trigger the explicit verbalization of visual evidence attended by VER heads. Comprehensive experiments demonstrate that VERA significantly improves long-context understanding of open-source VLMs: it yields an average relative improvement of 21.3% on Qwen3-VL-8B-Instruct and 20.1% on GLM-4.1V-Thinking across five benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

long-context understanding

visual evidence retrieval

complex reasoning

attention mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Evidence Retrieval Heads

attention analysis

long-context understanding