VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs

📅 2025-07-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

This work identifies a critical deficiency in state-of-the-art vision-language models (VLMs): their inability to perform non-local visual reasoning—i.e., integrating information from spatially distant image regions to solve multi-step cognitive tasks. To systematically evaluate this capability, the authors introduce three structured benchmark tasks: comparative perception (cross-image comparison), saccadic search (evidence-guided discrete localization), and continuous visual search (contour tracing)—designed to be intuitive for humans yet challenging for models. Experiments reveal that leading VLMs—including GPT-4o-mini, Gemini 2.5 Pro, and Claude Sonnet 4—achieve near-chance accuracy on these tasks, substantially underperforming human observers. This study constitutes the first systematic definition, formalization, and empirical evaluation of non-local visual reasoning in VLMs. It establishes a novel, interpretable, and reproducible evaluation paradigm, advancing the understanding of visual cognition in multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Visual Language Models (VLMs) excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple perceptual tests. We present an evaluation that tests vision-language models'capacity for nonlocal visual reasoning -- reasoning that requires chaining evidence collected from multiple, possibly distant, regions of an image. We isolate three distinct forms of non-local vision: comparative perception, which demands holding two images in working memory and comparing them; saccadic search, which requires making discrete, evidence-driven jumps to locate successive targets; and smooth visual search, which involves searching smoothly along a continuous contour. Flagship models (e.g., Gemini 2.5 Pro, Claude Vision 3.7, GPT-o4-mini), even those that perform well on prior primitive-vision benchmarks, fail these tests and barely exceed random accuracy on two variants of our tasks that are trivial for humans. Our structured evaluation suite allows us to test if VLMs can perform similar visual algorithms to humans. Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' ability for nonlocal visual reasoning tasks

Tests comparative perception, saccadic search, and smooth visual search

Shows models fail simple tasks trivial for humans

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating VLMs on nonlocal visual reasoning tasks

Testing comparative perception, saccadic, and smooth search

Structured suite reveals models lack human-like visual algorithms

🔎 Similar Papers

No similar papers found.