Disc3D: Automatic Curation of High-Quality 3D Dialog Data via Discriminative Object Referring

šŸ“… 2025-11-24
šŸ“ˆ Citations: 0
✨ Influential: 0
šŸ“„ PDF
šŸ¤– AI Summary
3D multimodal large language models (MLLMs) suffer from scarce high-quality conversational data and ambiguities in viewpoint- and object-reference expressions. To address these challenges, we propose the first fully automated pipeline for generating 3D scene dialogues: it integrates 2D multimodal LLMs, large language models, and rule-based constraints to perform meta-annotation collection, scene graph construction, relational refinement, and multi-task dialogue generation. We further introduce a discriminative object reference mechanism that explicitly disentangles viewpoint ambiguity from non-exclusive referring expressions. Based on this pipeline, we construct Disc3D—a large-scale, human-annotation-free 3D dialogue benchmark comprising 25K scenes and over 2 million samples. Evaluated on both public benchmarks and our newly established Disc3D-QA benchmark, our approach significantly advances 3D MLLM performance, effectively overcoming dual bottlenecks of data scarcity and referential ambiguity.

Technology Category

Application Category

šŸ“ Abstract
3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Addressing scarcity of high-quality 3D scene-dialogue datasets for MLLMs
Resolving viewpoint and object referring ambiguities in 3D spatial language
Automating 3D dialogue data generation to replace expensive human annotation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline converts 3D scans into dialogue data
Synergizes rule-based constraints with 2D MLLMs and LLMs
Generates discriminative object referring with compact descriptions
šŸ”Ž Similar Papers
No similar papers found.
S
Siyuan Wei
PICO, ByteDance, Beijing
Chunjie Wang
Chunjie Wang
Shenzhen Institutes of Advanced Technology, Chinese academy of Sciences
6GUAVRISISACWireless communication
X
Xiao Liu
PICO, ByteDance, Beijing
X
Xiaosheng Yan
PICO, ByteDance, Beijing
Z
Zhishan Zhou
PICO, ByteDance, Beijing
R
Rui Huang
Tsinghua University