Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing 3D large language models struggle with embodied reasoning about object–object and object–space relationships in scene-level question answering. To address this, we propose SCENECOT, the first framework to integrate chain-of-thought (CoT) reasoning into 3D scene understanding. Our method employs multimodal expert modules to extract visual cues, constructs 3D scene graphs to explicitly encode spatial structure, and introduces a stepwise subtask decomposition mechanism for fine-grained, interpretable reasoning. To support training and evaluation, we curate SCENECOT-185K—the first large-scale embodied CoT dataset for 3D scenes. Extensive experiments on multiple challenging 3D reasoning benchmarks demonstrate substantial improvements in answer accuracy and reasoning interpretability. Ablation studies confirm the efficacy of each component, while cross-scene evaluations validate strong generalization capability. SCENECOT thus establishes a new paradigm for embodied, reasoning-driven 3D scene understanding.

Technology Category

Application Category

📝 Abstract

Existing research on 3D Large Language Models (LLMs) still struggles to achieve grounded question-answering, primarily due to the under-exploration of the mech- anism of human-like scene-object grounded reasoning. This paper bridges the gap by presenting a novel framework. We first introduce a grounded Chain-of- Thought reasoning method in 3D scenes (SCENECOT), decoupling a complex reasoning task into simpler and manageable problems, and building corresponding visual clues based on multimodal expert modules. To enable such a method, we develop SCENECOT-185K, the first large-scale grounded CoT reasoning dataset, consisting of 185K high-quality instances. Extensive experiments across various complex 3D scene reasoning benchmarks demonstrate that our new framework achieves strong performance with high grounding-QA coherence. To the best of our knowledge, this is the first successful application of CoT reasoning to 3D scene understanding, enabling step-by-step human-like reasoning and showing potential for extension to broader 3D scene understanding scenarios.

Problem

Research questions and friction points this paper is trying to address.

Achieving grounded question-answering in 3D LLMs

Enabling human-like scene-object reasoning mechanisms

Applying Chain-of-Thought reasoning to 3D scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces grounded Chain-of-Thought reasoning for 3D scenes

Develops large-scale dataset with 185K high-quality instances

Decouples complex reasoning into simpler visual problems

🔎 Similar Papers

No similar papers found.