🤖 AI Summary
This work identifies the root cause of visual language models’ (VLMs’) incorrect answers despite the presence of decisive visual evidence: not failure to perceive relevant regions, but a decoupling between visual attention and reasoning—termed “seeing but not believing.” We present the first empirical evidence showing that VLMs’ visual attention distributions exhibit only weak correlation with answer correctness. To address this, we propose a training-free, inference-time hierarchical attention intervention method that selectively amplifies attention weights over evidence regions in deeper network layers, thereby explicitly bridging perception and reasoning. Evaluated on leading VLMs—including LLaVA, Qwen-VL, Gemma-3B-VL, and InternVL—our approach delivers consistent and substantial accuracy improvements across diverse architectures. These results validate both its effectiveness and strong cross-architecture generalizability. The method provides a novel diagnostic lens into multimodal alignment mechanisms in VLMs and offers a practical, lightweight tool for enhancing their visual grounding and reasoning fidelity.
📝 Abstract
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present. In this work, we systematically investigate whether these failures arise from not perceiving the evidence or from not leveraging it effectively. By examining layer-wise attention dynamics, we find that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions. Surprisingly, VLMs often perceive the visual evidence when outputting incorrect answers, a phenomenon we term ``seeing but not believing'' that widely exists in major VLM families. Building on this, we introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking. It requires no training and consistently improves accuracy across multiple families, including LLaVA, Qwen, Gemma, and InternVL. These results show that VLMs encode reliable evidence internally but under-utilize it, making such signals explicit can bridge the gap between perception and reasoning, advancing the diagnostic understanding and reliability of VLMs.