MM-Spatial: Exploring 3D Spatial Understanding in Multimodal LLMs

๐Ÿ“… 2025-03-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current multimodal large language models (MLLMs) exhibit limited capability in 3D spatial understandingโ€”e.g., indoor spatial relation reasoning, metric scale/distance estimation, and 3D localization. To address this, we propose MM-Spatial: (1) the first open-annotation, indoor-scene-centric 3D visual question answering dataset, CA-VQA, along with a dedicated evaluation benchmark; (2) the first supervised fine-tuning paradigm for MLLMs using only high-quality 3D scene data, enabling depth perception performance on par with specialized monocular depth estimation models; and (3) a novel 3D spatial modeling paradigm integrating metric depth maps with multi-view image encodings. Experiments demonstrate that MM-Spatial achieves state-of-the-art performance on both our custom and established 3D spatial understanding benchmarks, with depth estimation accuracy approaching that of domain-specific models. The dataset, benchmark, and code will be publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
Multimodal large language models (MLLMs) excel at 2D visual understanding but remain limited in their ability to reason about 3D space. In this work, we leverage large-scale high-quality 3D scene data with open-set annotations to introduce 1) a novel supervised fine-tuning dataset and 2) a new evaluation benchmark, focused on indoor scenes. Our Cubify Anything VQA (CA-VQA) data covers diverse spatial tasks including spatial relationship prediction, metric size and distance estimation, and 3D grounding. We show that CA-VQA enables us to train MM-Spatial, a strong generalist MLLM that also achieves state-of-the-art performance on 3D spatial understanding benchmarks, including our own. We show how incorporating metric depth and multi-view inputs (provided in CA-VQA) can further improve 3D understanding, and demonstrate that data alone allows our model to achieve depth perception capabilities comparable to dedicated monocular depth estimation models. We will publish our SFT dataset and benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enhance 3D spatial reasoning in multimodal LLMs
Introduce a novel dataset and benchmark for 3D understanding
Improve depth perception using metric depth and multi-view inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages large-scale 3D scene data
Introduces CA-VQA for spatial tasks
Incorporates metric depth and multi-view inputs
๐Ÿ”Ž Similar Papers
No similar papers found.