π€ AI Summary
Existing 3D multimodal methods struggle to reason about human intent and fine-grained spatial relationships among objects, and are often restricted to single-category scenes, yielding coarse-grained, uninterpretable textual explanations. To address this, we propose the **3D Reasoning Segmentation** taskβthe first to jointly model multi-object 3D segmentation and fine-grained spatial relation explanation. We introduce **ReasonSeg3D**, the first open-source benchmark featuring 3D segmentation masks, dense spatial relation annotations, and spatial-relation-aware question-answer pairs. Furthermore, we design **MORE3D**, a novel network that fuses multimodal features, incorporates a query-based segmentation mechanism, and employs a relation-driven cross-modal reasoning decoder. Experiments demonstrate that MORE3D significantly improves both segmentation accuracy and spatial relation description quality in complex, multi-object 3D scenes. This work establishes a new direction for interpretable, relation-aware 3D scene understanding in embodied intelligence.
π Abstract
The recent development in multimodal learning has greatly advanced the research in 3D scene understanding in various real-world tasks such as embodied AI. However, most existing work shares two typical constraints: 1) they are short of reasoning ability for interaction and interpretation of human intension and 2) they focus on scenarios with single-category objects only which leads to over-simplified textual descriptions due to the negligence of multi-object scenarios and spatial relations among objects. We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects. To this end, we create ReasonSeg3D, a large-scale and high-quality benchmark that integrates 3D segmentation masks and 3D spatial relations with generated question-answer pairs. In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects and tailored 3D scene understanding designs. MORE3D learns detailed explanations on 3D relations and employs them to capture spatial information of objects and reason textual outputs. Extensive experiments show that MORE3D excels in reasoning and segmenting complex multi-object 3D scenes, and the created ReasonSeg3D offers a valuable platform for future exploration of 3D reasoning segmentation. The dataset and code will be released.