Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning

📅 2024-09-30

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Current 3D large language models (3D LLMs) suffer from weak discriminative capability and poor generalization due to the scarcity of high-quality instruction-following data. To address this, we propose Robin3D—a novel framework for instruction-tuned 3D multimodal understanding. Its core contributions are: (1) a Robust Instruction Generation (RIG) engine that automatically synthesizes large-scale, diverse, and adversarial 3D instruction data; (2) a relation-enhanced projector coupled with ID-feature binding, which significantly improves spatial relationship modeling and object reference resolution; and (3) comprehensive 3D multimodal instruction tuning leveraging these components. Extensive experiments demonstrate that Robin3D consistently outperforms state-of-the-art methods across five major 3D benchmarks: it achieves +7.8% absolute improvement in localization accuracy on Multi3DRefer and +6.9% gain in caption quality on Scan2Cap—without task-specific fine-tuning.

Technology Category

Application Category

📝 Abstract

Recent advancements in 3D Large Language Models (3DLLMs) have highlighted their potential in building general-purpose agents in the 3D real world, yet challenges remain due to the lack of high-quality robust instruction-following data, leading to limited discriminative power and generalization of 3DLLMs. In this paper, we introduce Robin3D, a powerful 3DLLM trained on large-scale instruction-following data generated by our novel data engine, Robust Instruction Generation (RIG) engine. RIG generates two key instruction data: 1) the Adversarial Instruction-following data, which features mixed negative and positive samples to enhance the model's discriminative understanding. 2) the Diverse Instruction-following data, which contains various instruction styles to enhance model's generalization. As a result, we construct 1 million instruction-following data, consisting of 344K Adversarial samples, 508K Diverse samples, and 165K benchmark training set samples. To better handle these complex instructions, Robin3D first incorporates Relation-Augmented Projector to enhance spatial understanding, and then strengthens the object referring and grounding ability through ID-Feature Bonding. Robin3D consistently outperforms previous methods across five widely-used 3D multimodal learning benchmarks, without the need for task-specific fine-tuning. Notably, we achieve a 7.8% improvement in the grounding task (Multi3DRefer) and a 6.9% improvement in the captioning task (Scan2Cap).

Problem

Research questions and friction points this paper is trying to address.

Enhancing 3D Large Language Models' robustness.

Improving generalization in 3D multimodal tasks.

Increasing discriminative power with diverse instruction data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Robust Instruction Generation engine

Relation-Augmented Projector integration

ID-Feature Bonding enhancement

🔎 Similar Papers

No similar papers found.