MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Current multimodal large language models (MLLMs) exhibit limited capability in 3D spatial understanding—e.g., indoor spatial relation reasoning, metric scale/distance estimation, and 3D localization. To address this, we propose MM-Spatial: (1) the first open-annotation, indoor-scene-centric 3D visual question answering dataset, CA-VQA, along with a dedicated evaluation benchmark; (2) the first supervised fine-tuning paradigm for MLLMs using only high-quality 3D scene data, enabling depth perception performance on par with specialized monocular depth estimation models; and (3) a novel 3D spatial modeling paradigm integrating metric depth maps with multi-view image encodings. Experiments demonstrate that MM-Spatial achieves state-of-the-art performance on both our custom and established 3D spatial understanding benchmarks, with depth estimation accuracy approaching that of domain-specific models. The dataset, benchmark, and code will be publicly released.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.

Problem

Research questions and friction points this paper is trying to address.

Enhance 3D spatial reasoning in multimodal LLMs

Introduce a novel dataset and benchmark for 3D understanding

Improve depth perception using metric depth and multi-view inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large-scale 3D scene data

Introduces CA-VQA for spatial tasks

Incorporates metric depth and multi-view inputs

🔎 Similar Papers

SPARTUN3D: Situated Spatial Understanding of 3D World in Large Language Models