Struct2D: A Perception-Guided Framework for Spatial Reasoning in Large Multimodal Models

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

Can large multimodal models (LMMs) perform robust 3D spatial reasoning solely from perception-driven, structured 2D representations—such as bird’s-eye views, object annotations, and metadata—without explicit 3D inputs or specialized architectures? Method: We propose a perception-guided structured 2D prompting framework and introduce Struct2D-Set, the first large-scale spatial reasoning instruction-tuning dataset comprising 200K QA pairs spanning eight task categories. Data is efficiently synthesized via automated, 3D-scene-driven generation. Results: Experiments demonstrate that closed-source LMMs (e.g., GPT-4o) exhibit strong zero-shot 3D spatial reasoning capabilities. After fine-tuning on Struct2D-Set, open-source models—specifically Qwen2.5-VL—achieve state-of-the-art performance among open models on benchmarks including 3D question answering, dense scene description, and object localization. Notably, the approach significantly improves accuracy in relative pose estimation and path planning, validating the efficacy of structured 2D priors for complex 3D reasoning.

Technology Category

Application Category

📝 Abstract

Unlocking spatial reasoning in Large Multimodal Models (LMMs) is crucial for enabling intelligent interaction with 3D environments. While prior efforts often rely on explicit 3D inputs or specialized model architectures, we ask: can LMMs reason about 3D space using only structured 2D representations derived from perception? We introduce Struct2D, a perception-guided prompting framework that combines bird's-eye-view (BEV) images with object marks and object-centric metadata, optionally incorporating egocentric keyframes when needed. Using Struct2D, we conduct an in-depth zero-shot analysis of closed-source LMMs (e.g., GPT-o3) and find that they exhibit surprisingly strong spatial reasoning abilities when provided with structured 2D inputs, effectively handling tasks such as relative direction estimation and route planning. Building on these insights, we construct Struct2D-Set, a large-scale instruction tuning dataset with 200K fine-grained QA pairs across eight spatial reasoning categories, generated automatically from 3D indoor scenes. We fine-tune an open-source LMM (Qwen2.5VL) on Struct2D-Set, achieving competitive performance on multiple benchmarks, including 3D question answering, dense captioning, and object grounding. Our approach demonstrates that structured 2D inputs can effectively bridge perception and language reasoning in LMMs-without requiring explicit 3D representations as input. We will release both our code and dataset to support future research.

Problem

Research questions and friction points this paper is trying to address.

Enabling 3D spatial reasoning in LMMs using structured 2D representations

Evaluating LMMs' spatial abilities with bird's-eye-view images and metadata

Bridging perception and language reasoning without explicit 3D inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses structured 2D representations for 3D reasoning

Combines BEV images with object metadata

Generates large-scale QA dataset automatically

🔎 Similar Papers

LLMI3D: MLLM-based 3D Perception from a Single 2D Image