Evaluating Compositional Scene Understanding in Multimodal Generative Models

📅 2025-03-29

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study investigates the capabilities and limitations of multimodal generative models (e.g., DALL-E 3, GPT-4V) in compositional scene understanding—specifically, scenes involving more than five objects and multiple spatial or semantic relations—and quantifies their performance gap relative to human cognition. To this end, we introduce the first standardized benchmark for compositional visual reasoning, comprising a structured test set generated via controllable synthetic prompting, a unified cross-model evaluation protocol, and a human–model comparative experimental paradigm. Results show that while current models significantly outperform prior generations on simple compositional tasks, their accuracy drops sharply on complex scenes, averaging 42.6% lower than human performance—revealing a fundamental bottleneck in structured visual reasoning. Our core contribution is the first comprehensive evaluation framework for compositional scene understanding with an explicit human baseline, which clearly exposes the generational gap between state-of-the-art models and humans in symbolic spatial relation modeling.

Technology Category

Application Category

📝 Abstract

The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.

Problem

Research questions and friction points this paper is trying to address.

Evaluating compositional scene understanding in multimodal generative models

Assessing capability to generate and interpret multi-object relational scenes

Comparing model performance to humans on complex compositional tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates DALL-E 3 for text-to-image generation

Tests GPT-4V and other multimodal vision-language models

Compares model performance to human participants

🔎 Similar Papers

No similar papers found.