Evaluating Compositional Scene Understanding in Multimodal Generative Models

📅 2025-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the capabilities and limitations of multimodal generative models (e.g., DALL-E 3, GPT-4V) in compositional scene understanding—specifically, scenes involving more than five objects and multiple spatial or semantic relations—and quantifies their performance gap relative to human cognition. To this end, we introduce the first standardized benchmark for compositional visual reasoning, comprising a structured test set generated via controllable synthetic prompting, a unified cross-model evaluation protocol, and a human–model comparative experimental paradigm. Results show that while current models significantly outperform prior generations on simple compositional tasks, their accuracy drops sharply on complex scenes, averaging 42.6% lower than human performance—revealing a fundamental bottleneck in structured visual reasoning. Our core contribution is the first comprehensive evaluation framework for compositional scene understanding with an explicit human baseline, which clearly exposes the generational gap between state-of-the-art models and humans in symbolic spatial relation modeling.

Technology Category

Application Category

📝 Abstract
The visual world is fundamentally compositional. Visual scenes are defined by the composition of objects and their relations. Hence, it is essential for computer vision systems to reflect and exploit this compositionality to achieve robust and generalizable scene understanding. While major strides have been made toward the development of general-purpose, multimodal generative models, including both text-to-image models and multimodal vision-language models, it remains unclear whether these systems are capable of accurately generating and interpreting scenes involving the composition of multiple objects and relations. In this work, we present an evaluation of the compositional visual processing capabilities in the current generation of text-to-image (DALL-E 3) and multimodal vision-language models (GPT-4V, GPT-4o, Claude Sonnet 3.5, QWEN2-VL-72B, and InternVL2.5-38B), and compare the performance of these systems to human participants. The results suggest that these systems display some ability to solve compositional and relational tasks, showing notable improvements over the previous generation of multimodal models, but with performance nevertheless well below the level of human participants, particularly for more complex scenes involving many ($>5$) objects and multiple relations. These results highlight the need for further progress toward compositional understanding of visual scenes.
Problem

Research questions and friction points this paper is trying to address.

Evaluating compositional scene understanding in multimodal generative models
Assessing capability to generate and interpret multi-object relational scenes
Comparing model performance to humans on complex compositional tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates DALL-E 3 for text-to-image generation
Tests GPT-4V and other multimodal vision-language models
Compares model performance to human participants
🔎 Similar Papers
No similar papers found.
S
Shuhao Fu
Department of Psychology, University of California, Los Angeles
A
Andrew Jun Lee
Department of Psychology, University of California, Los Angeles
A
Anna Wang
Department of Psychology, University of California, Los Angeles
Ida Momennejad
Ida Momennejad
Microsoft Research
Reinforcement LearningMemory and Planningmulti-agent learninghippocampusPrefrontal Cortex
Trevor Bihl
Trevor Bihl
Ohio University
language modelsMilitary Operations Researchcyber securityanalogical reasoningneuromorphics
Hongjing Lu
Hongjing Lu
University of California, Los Angeles (UCLA)
PsychologyVisionCognitive psychologyPerceptionComputational cognition
T
Taylor W. Webb
Microsoft Research, NYC