Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This study investigates whether vision-language models (VLMs) genuinely re-examine visual input when they claim to “look back at the image” during reasoning. To this end, the authors propose VisualSwap, a probing framework that replaces the original input image with a semantically distinct yet visually similar counterpart immediately after the model generates a reflexive statement, thereby testing its sensitivity to visual changes. The work reveals, for the first time, that VLMs’ reflexive utterances often lack authentic visual grounding: chain-of-thought models are more susceptible to such perturbations than instruction-tuned counterparts, and scaling model size does not mitigate this deficiency. Evaluations via attention analysis and the newly introduced VS-Bench—comprising 800 image pairs—show accuracy drops of up to 60% on Qwen3-VL, Kimi-VL, and ERNIE-VL, confirming that current VLMs “talk more than they see,” though multi-turn user instructions can effectively restore visual grounding.

📝 Abstract

Vision-Language Models (VLMs) often produce self-reflective statements like "let me check the figure again" during reasoning. Do such statements trigger genuine visual re-examination, or are they merely learned textual patterns? We investigate this via VisualSwap, an image-swap probing framework: after a model reasons over an image, we replace it with a visually similar but semantically different one and test whether the model notices. We introduce VS-Bench, 800 image pairs curated from MathVista, MathVerse, MathVision, and MMMU-Pro. Experiments on Qwen3-VL, Kimi-VL, and ERNIE-VL reveal a striking failure: models overwhelmingly miss the swap, with accuracy dropping by up to 60%. Counterintuitively, thinking models are nearly 3x more vulnerable than their instructed counterparts, and scaling offers no mitigation. Multi-turn user instructions restore visual grounding, but self-generated reflective statements during continuous generation do not. Attention analysis explains why: user instructions substantially elevate attention to visual tokens, whereas self-reflection does not. Current VLMs tend to say rather than actually see when claiming to perform visual re-examination. Our code and dataset are available at the project page: https://visualswap.github.io

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

visual re-examination

image-swap probing

visual grounding

self-reflective statements

Innovation

Methods, ideas, or system contributions that make the work stand out.

VisualSwap

Vision-Language Models

visual re-examination