VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

The fundamental visual understanding capabilities of existing vision-language models (VLMs) remain poorly characterized in highly dense scenes—such as public-domain artworks depicting multiple characters, concurrent actions, and complex backgrounds. Method: We introduce VisualOverload, the first VQA benchmark explicitly designed to evaluate fine-grained visual understanding in dense scenes. It comprises 2,720 expert-annotated question-answer pairs spanning six knowledge-free task categories: object recognition, counting, OCR, spatial relations, action logic, and consistency reasoning. Built on high-resolution artistic images, it features hierarchical difficulty levels and systematic error analysis. Results: Evaluations across 37 state-of-the-art VLMs reveal a modest overall accuracy of 69.5%, dropping to only 19.6% on the most challenging subset—uncovering systemic deficiencies in fine-detail encoding, local perception, and logical reasoning under visual clutter.

Technology Category

Application Category

📝 Abstract

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' visual understanding in dense scenes

Testing simple vision tasks in overloaded visual environments

Identifying failure modes in current vision-language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created a dense scene VQA benchmark with paintings

Manually annotated six task categories for evaluation

Revealed model failures in counting and OCR

🔎 Similar Papers

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions