3D-DRES: Detailed 3D Referring Expression Segmentation

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing 3D visual grounding methods, which are typically confined to sentence-level detection or segmentation and thus fail to leverage the compositional semantics and contextual reasoning inherent in natural language. To enable finer-grained 3D vision-language understanding, we introduce the task of fine-grained 3D Densely Referring Expression Segmentation (3D-DRES), which establishes explicit mappings from linguistic phrases to 3D object instances. We pioneer a phrase-to-instance annotation paradigm and construct DetailRefer, a large-scale dataset comprising 54,432 referring expressions. Furthermore, we propose DetailBase, a unified architecture capable of performing both sentence-level and phrase-level segmentation. Experiments demonstrate that models trained on DetailRefer achieve state-of-the-art performance on phrase-level segmentation and significantly outperform prior methods on standard 3D-RES benchmarks.

Technology Category

Application Category

📝 Abstract
Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.
Problem

Research questions and friction points this paper is trying to address.

3D referring expression segmentation
phrase-level grounding
3D vision-language understanding
fine-grained segmentation
compositional reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D referring expression segmentation
phrase-instance alignment
fine-grained vision-language understanding
3D visual grounding
DetailRefer dataset
🔎 Similar Papers
No similar papers found.
Qi Chen
Qi Chen
Xiamen University
C
Changli Wu
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China; Shanghai Innovation Institute
Jiayi Ji
Jiayi Ji
Rutgers University
Yiwei Ma
Yiwei Ma
Stevens Institute of Technology
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China