Reasoning Matters for 3D Visual Grounding

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the limitations of existing 3D visual grounding methods, which heavily rely on large-scale annotated data and suffer from inefficient synthetic data lacking explicit reasoning capabilities. To overcome these challenges, we propose the first automatic 3D visual grounding data synthesis pipeline that incorporates an explicit reasoning process. Leveraging this pipeline, we fine-tune a large language model to achieve effective cross-modal feature alignment and efficient training. Our approach significantly reduces dependence on extensive labeled or synthetic datasets. The resulting model, Reason3DVG-8B, trained on only 1.6% of the data used by prior methods, outperforms the current state-of-the-art LLM-based approach, 3D-GRAND, thereby demonstrating the effectiveness and efficiency of reasoning-enhanced synthetic data for 3D visual grounding.

Technology Category

Application Category

📝 Abstract

The recent development of Large Language Models (LLMs) with strong reasoning ability has driven research in various domains such as mathematics, coding, and scientific discovery. Meanwhile, 3D visual grounding, as a fundamental task in 3D understanding, still remains challenging due to the limited reasoning ability of recent 3D visual grounding models. Most of the current methods incorporate a text encoder and visual feature encoder to generate cross-modal fuse features and predict the referring object. These models often require supervised training on extensive 3D annotation data. On the other hand, recent research also focus on scaling synthetic data to train stronger 3D visual grounding LLM, however, the performance gain remains limited and non-proportional to the data collection cost. In this work, we propose a 3D visual grounding data pipeline, which is capable of automatically synthesizing 3D visual grounding data along with corresponding reasoning process. Additionally, we leverage the generated data for LLM fine-tuning and introduce Reason3DVG-8B, a strong 3D visual grounding LLM that outperforms previous LLM-based method 3D-GRAND using only 1.6% of their training data, demonstrating the effectiveness of our data and the importance of reasoning in 3D visual grounding.

Problem

Research questions and friction points this paper is trying to address.

3D visual grounding

reasoning

large language models

synthetic data

data efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D visual grounding

reasoning

synthetic data generation