Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer from the “binding problem”—a fundamental limitation in accurately associating low-level visual features with their corresponding objects—thereby impairing performance on counting, visual search, scene description, and spatial relation understanding. To address this, we propose a structure-guided binding enhancement method: explicitly injecting low-level visual structural priors (e.g., horizontal lines) into the input and designing serialized, spatially aware textual prompts to jointly activate the model’s spatial modeling and sequential attention mechanisms. This approach transcends conventional text-only prompting by highlighting the critical role of structured visual input in grounding reasoning and feature alignment. Evaluations on GPT-4o demonstrate substantial improvements: +25.0% accuracy in visual search, +26.83% in counting, −0.32 reduction in edit distance for scene description, and +9.50% gain in spatial relation understanding.

Technology Category

Application Category

📝 Abstract
Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the extit{binding problem}: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces a simple yet effective intervention: augmenting visual inputs with low-level spatial structures (e.g., horizontal lines) and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, our method improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. Our method enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.
Problem

Research questions and friction points this paper is trying to address.

VLMs fail to associate features with correct visual referents
Current VLMs lack spatially grounded serial attention mechanisms
Improving visual reasoning in VLMs with low-level spatial structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augmenting visual inputs with spatial structures
Encouraging sequential spatially-aware parsing
Improving binding with single-query inference
🔎 Similar Papers
No similar papers found.
A
Amirmohammad Izadi
Department of Computer Engineering, Sharif University of Technology
Mohammad Ali Banayeeanzade
Mohammad Ali Banayeeanzade
Bachelor student of Computer Engineering, Sharif University of Technology
Generative modelsVision Language ModelsLarge Language ModelsCompositional Generation
F
Fatemeh Askari
Department of Computer Engineering, Sharif University of Technology
A
Ali Rahimiakbar
Department of Computer Engineering, Sharif University of Technology
M
Mohammad Mahdi Vahedi
Department of Computer Engineering, Sharif University of Technology
Hosein Hasani
Hosein Hasani
Sharif University of Technology
Machine Learning
Mahdieh Soleymani Baghshah
Mahdieh Soleymani Baghshah
Associate Professor, Computer Engineering Department, Sharif University of Technology
Deep LearningMachine Learning