🤖 AI Summary
Vision-language models (VLMs) suffer from the “binding problem”—a fundamental limitation in accurately associating low-level visual features with their corresponding objects—thereby impairing performance on counting, visual search, scene description, and spatial relation understanding. To address this, we propose a structure-guided binding enhancement method: explicitly injecting low-level visual structural priors (e.g., horizontal lines) into the input and designing serialized, spatially aware textual prompts to jointly activate the model’s spatial modeling and sequential attention mechanisms. This approach transcends conventional text-only prompting by highlighting the critical role of structured visual input in grounding reasoning and feature alignment. Evaluations on GPT-4o demonstrate substantial improvements: +25.0% accuracy in visual search, +26.83% in counting, −0.32 reduction in edit distance for scene description, and +9.50% gain in spatial relation understanding.
📝 Abstract
Despite progress in Vision-Language Models (VLMs), their capacity for visual reasoning is often limited by the extit{binding problem}: the failure to reliably associate perceptual features with their correct visual referents. This limitation underlies persistent errors in tasks such as counting, visual search, scene description, and spatial relationship understanding. A key factor is that current VLMs process visual features largely in parallel, lacking mechanisms for spatially grounded, serial attention. This paper introduces a simple yet effective intervention: augmenting visual inputs with low-level spatial structures (e.g., horizontal lines) and pairing this with a textual prompt that encourages sequential, spatially-aware parsing. We empirically demonstrate substantial performance improvements across core visual reasoning tasks. Specifically, our method improves GPT-4o visual search accuracy by 25.00%, increases counting accuracy by 26.83%, reduces edit distance error in scene description by 0.32, and enhances performance on spatial relationship tasks by 9.50% on a a 2D synthetic dataset. Furthermore, we find that the visual modification is essential for these gains; purely textual strategies, including Chain-of-Thought prompting, are insufficient and can even degrade performance. Our method enhances binding only with a single-query inference, underscoring the importance of visual input design over purely linguistically-based approaches. These findings suggest that low-level visual structuring is a powerful and underexplored direction for improving compositional visual reasoning and could serve as a general strategy for enhancing VLM performance on spatially grounded tasks.