GenSpace: Benchmarking Spatially-Aware Image Generation

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work investigates whether state-of-the-art AI image generation models possess human-like 3D spatial perception. To address the limitation of existing evaluation methods—which fail to capture fine-grained spatial errors—we introduce the first benchmark specifically designed for assessing 3D spatial awareness in generative models. Our method employs a novel evaluation pipeline leveraging collaborative 3D scene reconstruction from multiple vision foundation models. We propose a human-aligned spatial fidelity metric that integrates multi-modal vision-language models (VLMs) with three key components: multi-view geometric analysis, self-to-external coordinate system transformation modeling, and metric consistency verification. Experimental results demonstrate systematic deficits across three core dimensions—object localization, spatial relational reasoning, and metric accuracy—with all current SOTA models scoring significantly below human baselines. This study is the first to systematically identify, characterize, and quantify three fundamental bottlenecks in generative models’ 3D spatial understanding.

Technology Category

Application Category

📝 Abstract

Humans can intuitively compose and arrange scenes in the 3D space for photography. However, can advanced AI image generators plan scenes with similar 3D spatial awareness when creating images from text or image prompts? We present GenSpace, a novel benchmark and evaluation pipeline to comprehensively assess the spatial awareness of current image generation models. Furthermore, standard evaluations using general Vision-Language Models (VLMs) frequently fail to capture the detailed spatial errors. To handle this challenge, we propose a specialized evaluation pipeline and metric, which reconstructs 3D scene geometry using multiple visual foundation models and provides a more accurate and human-aligned metric of spatial faithfulness. Our findings show that while AI models create visually appealing images and can follow general instructions, they struggle with specific 3D details like object placement, relationships, and measurements. We summarize three core limitations in the spatial perception of current state-of-the-art image generation models: 1) Object Perspective Understanding, 2) Egocentric-Allocentric Transformation and 3) Metric Measurement Adherence, highlighting possible directions for improving spatial intelligence in image generation.

Problem

Research questions and friction points this paper is trying to address.

Assessing AI image generators' 3D spatial awareness

Evaluating spatial errors in image generation models

Identifying limitations in object placement and measurements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking spatial awareness in image generation

Specialized 3D geometry reconstruction pipeline

Identifying key spatial perception limitations

🔎 Similar Papers

No similar papers found.