3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address the scarcity of high-quality 3D-language data for embodied indoor question answering (QA) and dense captioning, this paper proposes the first unified modality-context reasoning framework tailored for 3D scenes, leveraging foundation model collaboration to jointly generate QA pairs and object descriptions. Our method integrates multimodal embeddings, cross-modal interaction mechanisms, and a language decoder, trained on ScanNet augmented with ScanQA and ScanRefer annotations. Semantic filtering and data augmentation are incorporated to enhance generation fidelity. To our knowledge, this is the first work enabling controllable, joint generation of high-quality QA and object descriptions in 3D scenes, yielding a new benchmark dataset comprising 62K QA pairs and 73K object descriptions. Evaluation shows consistent improvements: +2.15 CIDEr on ScanQA and +1.84 CIDEr@0.5 on ScanRefer—substantially outperforming prior approaches.

Technology Category

Application Category

📝 Abstract

With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

Problem

Research questions and friction points this paper is trying to address.

Generates large-scale 3D-language datasets for indoor tasks

Enhances reasoning in 3D environments via multimodal integration

Improves QA and dense captioning performance on benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages foundational models for 3D-language datasets

Integrates multi-modal embedding and cross-modal interaction

Employs data augmentation and semantic filtering

🔎 Similar Papers

Chain-of-Action: Faithful and Multimodal Question Answering through Large Language Models

2024-03-26arXiv.orgCitations: 9