MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

📅 2024-06-13
🏛️ Neural Information Processing Systems
📈 Citations: 16
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D multimodal datasets are limited to object attributes or simple spatial relations, lacking hierarchical semantic understanding—from regions to objects and from single- to multi-object relationships. To address this, we introduce MMScan, the first large-scale multimodal 3D scene dataset, comprising 1.4M natural language descriptions, 109K object annotations, and 7.7K region annotations. We propose a novel top-down hierarchical linguistic annotation paradigm, integrating vision-language models (VLMs) for semi-automatic initial labeling and human-in-the-loop verification to ensure naturalness, accuracy, and fine-grained 3D semantic alignment. Rigorous multi-stage quality control guarantees annotation reliability. Evaluated on 3D visual grounding and LLM-driven 3D question answering, models trained on MMScan demonstrate significantly improved cross-scene generalization. MMScan establishes a foundational resource—both data and methodology—for advancing multimodal 3D perception research.

Technology Category

Application Category

📝 Abstract
With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation. Codes, datasets, and benchmarks will be available at https://github.com/OpenRobotLab/EmbodiedScan.
Problem

Research questions and friction points this paper is trying to address.

Lack of large multi-modal 3D datasets with hierarchical annotations
Limited focus on holistic spatial and attribute understanding in 3D scenes
Need for efficient annotation methods combining VLMs and human correction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical grounded language annotations for 3D scenes
VLMs with prompts for efficient annotation initialization
Human-in-the-loop correction for natural comprehensive annotations
🔎 Similar Papers
No similar papers found.