🤖 AI Summary
This study investigates the capabilities and limitations of multimodal generative models (e.g., DALL-E 3, GPT-4V) in compositional scene understanding—specifically, scenes involving more than five objects and multiple spatial or semantic relations—and quantifies their performance gap relative to human cognition. To this end, we introduce the first standardized benchmark for compositional visual reasoning, comprising a structured test set generated via controllable synthetic prompting, a unified cross-model evaluation protocol, and a human–model comparative experimental paradigm. Results show that while current models significantly outperform prior generations on simple compositional tasks, their accuracy drops sharply on complex scenes, averaging 42.6% lower than human performance—revealing a fundamental bottleneck in structured visual reasoning. Our core contribution is the first comprehensive evaluation framework for compositional scene understanding with an explicit human baseline, which clearly exposes the generational gap between state-of-the-art models and humans in symbolic spatial relation modeling.
📝 Abstract
The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.