π€ AI Summary
Current 3D large language models (3D LLMs) suffer from weak discriminative capability and poor generalization due to the scarcity of high-quality instruction-following data. To address this, we propose Robin3Dβa novel framework for instruction-tuned 3D multimodal understanding. Its core contributions are: (1) a Robust Instruction Generation (RIG) engine that automatically synthesizes large-scale, diverse, and adversarial 3D instruction data; (2) a relation-enhanced projector coupled with ID-feature binding, which significantly improves spatial relationship modeling and object reference resolution; and (3) comprehensive 3D multimodal instruction tuning leveraging these components. Extensive experiments demonstrate that Robin3D consistently outperforms state-of-the-art methods across five major 3D benchmarks: it achieves +7.8% absolute improvement in localization accuracy on Multi3DRefer and +6.9% gain in caption quality on Scan2Capβwithout task-specific fine-tuning.
π Abstract
Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).