š¤ AI Summary
3D multimodal large language models (MLLMs) suffer from scarce high-quality conversational data and ambiguities in viewpoint- and object-reference expressions. To address these challenges, we propose the first fully automated pipeline for generating 3D scene dialogues: it integrates 2D multimodal LLMs, large language models, and rule-based constraints to perform meta-annotation collection, scene graph construction, relational refinement, and multi-task dialogue generation. We further introduce a discriminative object reference mechanism that explicitly disentangles viewpoint ambiguity from non-exclusive referring expressions. Based on this pipeline, we construct Disc3Dāa large-scale, human-annotation-free 3D dialogue benchmark comprising 25K scenes and over 2 million samples. Evaluated on both public benchmarks and our newly established Disc3D-QA benchmark, our approach significantly advances 3D MLLM performance, effectively overcoming dual bottlenecks of data scarcity and referential ambiguity.
š Abstract
3D Multi-modal Large Language Models (MLLMs) still lag behind their 2D peers, largely because large-scale, high-quality 3D scene-dialogue datasets remain scarce. Prior efforts hinge on expensive human annotation and leave two key ambiguities unresolved: viewpoint ambiguity, where spatial language presumes unknown camera poses, and object referring ambiguity, where non-exclusive descriptions blur the line between targets and distractors. We therefore present a fully automated pipeline that converts raw 3D scans into unambiguous, high-quality dialogue data at a fraction of the previous cost. By synergizing rule-based constraints with 2D MLLMs and LLMs, the pipeline enables controllable, scalable generation without human intervention. The pipeline comprises four stages: (1) meta-annotation collection harvesting object-, frame-, and scene-level captions, (2) scene graph construction with relation correction to capture proximal object relations, (3) discriminative object referring that generates exclusive and compact descriptions, and (4) multi-task data generation synthesizing diverse dialogues. Our pipeline systematically mitigates inherent flaws in source datasets and produces the final Disc3D dataset, over 2 million samples in 25K hybrid 3D scenes, spanning scene, view, and object captioning, visual grounding, and five object-centric QA tasks. Extensive experiments demonstrate that training with Disc3D yields consistent, significant improvements on both public benchmarks and our multifaceted Disc3D-QA tasks. Code, data, and models will be publicly available.