Exploring Spatial Intelligence from a Generative Perspective

📅 2026-04-22

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the gap in evaluating generative spatial intelligence (GSI)—the ability of multimodal large models to adhere to and manipulate 3D spatial constraints during image generation—a capability overlooked by existing benchmarks that primarily focus on spatial understanding. The study introduces GSI as a novel concept and presents GSI-Bench, the first benchmark tailored for spatially aware image editing, comprising both real-world (GSI-Real) and synthetic (GSI-Syn) datasets. A unified, model-agnostic evaluation protocol is established, leveraging 3D priors and controllable spatial manipulations to effectively assess spatial compliance and editing fidelity. Experiments demonstrate that models fine-tuned on GSI-Syn not only excel in generative tasks but also exhibit significantly enhanced downstream spatial reasoning performance, providing the first empirical evidence that generative training can substantively improve spatial intelligence.

Technology Category

Application Category

📝 Abstract

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

Problem

Research questions and friction points this paper is trying to address.

spatial intelligence

generative spatial intelligence

multimodal models

image generation

3D spatial constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Spatial Intelligence

GSI-Bench

spatially grounded image editing

3D-prior-guided generation

multimodal models

🔎 Similar Papers

No similar papers found.