🤖 AI Summary
Current 3D large language models (3D-LLMs) suffer from unreliable language grounding and embodied understanding due to the scarcity of large-scale, language–3D-scene densely aligned instruction data. To address this, we introduce the first million-scale 3D-LLM instruction dataset—comprising 40K household scenes and 6.2M diverse, semantically rich instructions—and propose 3D-POPE, a novel hallucination evaluation benchmark for 3D-LLMs. Leveraging 3D scene synthesis, instruction template engineering, and multimodal alignment modeling, we demonstrate effective sim-to-real transfer of synthetic data to real-world ScanNet scenes. Our approach significantly enhances visual grounding capability, reducing hallucination rates by 37.2%, and achieves state-of-the-art performance across multiple 3D understanding tasks. Furthermore, we empirically uncover the scaling law governing 3D-LLM performance with respect to instruction-data scale—the first such characterization in the field. This work advances the deep integration of natural language understanding and 3D perception.
📝 Abstract
The integration of language and 3D perception is crucial for embodied agents and robots that comprehend and interact with the physical world. While large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, their adaptation to 3D environments (3D-LLMs) remains in its early stages. A primary challenge is a lack of large-scale datasets with dense grounding between language and 3D scenes. We introduce 3D-GRAND, a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions. Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs. As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs, enabling fair comparisons of models. Our experiments highlight a scaling effect between dataset size and 3D-LLM performance, emphasizing the importance of large-scale 3D-text datasets for embodied AI research. Our results demonstrate early signals for effective sim-to-real transfer, indicating that models trained on large synthetic data can perform well on real-world 3D scans. Through 3D-GRAND and 3D-POPE, we aim to equip the embodied AI community with resources and insights to lead to more reliable and better-grounded 3D-LLMs. Project website: https://3d-grand.github.io