🤖 AI Summary
Contemporary advanced vision-language models (VLMs) exhibit systematic failures on multi-object grounding reasoning tasks—such as counting, localization, and visual analogy—performing markedly below human capability. Method: Drawing on cognitive science, this work formally introduces the “binding problem” into VLM analysis for the first time, establishing cross-scale correspondences between VLM failure patterns and known limitations of human feedforward visual processing. Using a theory-driven attribution framework—integrating cognitive modeling, behavioral experiment analysis, and large-scale VLM benchmarking—we move beyond purely data-driven optimization to explain counterintuitive failures in elementary visual reasoning. Contribution/Results: Our analysis reveals fundamental representational bottlenecks in current VLMs, providing critical theoretical foundations and architectural design principles for next-generation models capable of entity-decoupled, compositional visual understanding.
📝 Abstract
Recent work has documented striking heterogeneity in the performance of state-of-the-art vision language models (VLMs), including both multimodal language models and text-to-image models. These models are able to describe and generate a diverse array of complex, naturalistic images, yet they exhibit surprising failures on basic multi-object reasoning tasks -- such as counting, localization, and simple forms of visual analogy -- that humans perform with near perfect accuracy. To better understand this puzzling pattern of successes and failures, we turn to theoretical accounts of the binding problem in cognitive science and neuroscience, a fundamental problem that arises when a shared set of representational resources must be used to represent distinct entities (e.g., to represent multiple objects in an image), necessitating the use of serial processing to avoid interference. We find that many of the puzzling failures of state-of-the-art VLMs can be explained as arising due to the binding problem, and that these failure modes are strikingly similar to the limitations exhibited by rapid, feedforward processing in the human brain.