Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model

📅 2024-05-27
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
Existing large language models (LLMs) are limited in 3D understanding, as they produce only textual or numerical outputs and cannot generate dense, semantically aligned segmentation masks. To address this, we propose the first multimodal framework enabling joint point cloud–text reasoning and end-to-end 3D mask generation. Our method introduces a hierarchical mask decoder that progressively refines predictions from coarse localization to fine-grained segmentation masks, and a dual-LLM-token-driven segmentation mechanism that establishes explicit semantic mapping from text tokens to 3D spatial regions. Evaluated on ScanNet and Matterport3D, our approach significantly outperforms state-of-the-art methods across multiple 3D segmentation tasks—including referring expression segmentation, 3D reasoning segmentation, and open-vocabulary mask-based visual question answering—achieving mIoU improvements exceeding 8.2%. To our knowledge, this is the first work to enable LLM-driven 3D scene understanding with pixel-level segmentation mask outputs.

Technology Category

Application Category

📝 Abstract
Recent advancements in multimodal large language models (LLMs) have demonstrated significant potential across various domains, particularly in concept reasoning. However, their applications in understanding 3D environments remain limited, primarily offering textual or numerical outputs without generating dense, informative segmentation masks. This paper introduces Reason3D, a novel LLM designed for comprehensive 3D understanding. Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks, enabling advanced tasks such as 3D reasoning segmentation, hierarchical searching, express referring, and question answering with detailed mask outputs. We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes. It begins with a coarse location estimation, followed by object mask estimation, using two unique tokens predicted by LLMs based on the textual query. Experimental results on large-scale ScanNet and Matterport3D datasets validate the effectiveness of our Reason3D across various tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D environment understanding using LLMs
Generating dense segmentation masks from point clouds
Implementing hierarchical mask decoder for object segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM for 3D understanding
Hierarchical mask decoder
Coarse-to-fine segmentation
🔎 Similar Papers
No similar papers found.