3D-MoRe: Unified Modal-Contextual Reasoning for Embodied Question Answering

📅 2025-07-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of high-quality 3D-language data for embodied indoor question answering (QA) and dense captioning, this paper proposes the first unified modality-context reasoning framework tailored for 3D scenes, leveraging foundation model collaboration to jointly generate QA pairs and object descriptions. Our method integrates multimodal embeddings, cross-modal interaction mechanisms, and a language decoder, trained on ScanNet augmented with ScanQA and ScanRefer annotations. Semantic filtering and data augmentation are incorporated to enhance generation fidelity. To our knowledge, this is the first work enabling controllable, joint generation of high-quality QA and object descriptions in 3D scenes, yielding a new benchmark dataset comprising 62K QA pairs and 73K object descriptions. Evaluation shows consistent improvements: +2.15 CIDEr on ScanQA and +1.84 CIDEr@0.5 on ScanRefer—substantially outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by leveraging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.
Problem

Research questions and friction points this paper is trying to address.

Generates large-scale 3D-language datasets for indoor tasks
Enhances reasoning in 3D environments via multimodal integration
Improves QA and dense captioning performance on benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages foundational models for 3D-language datasets
Integrates multi-modal embedding and cross-modal interaction
Employs data augmentation and semantic filtering
🔎 Similar Papers
No similar papers found.
Rongtao Xu
Rongtao Xu
MBZUAI << CASIA << HUST
Intelligent RobotEmbodied AIVLAVLMSpatialtemporal AI
H
Han Gao
Beijing University of Posts and Telecommunications, China
M
Mingming Yu
Institute of Automation, Chinese Academy of Sciences, China
D
Dong An
Institute of Automation, Chinese Academy of Sciences, China
S
Shunpeng Chen
Beijing University of Posts and Telecommunications, China
Changwei Wang
Changwei Wang
Shandong Computer Science Center
Multimodal LearningEmbodied AIEdge Intelligent ComputingAI for HealthcareSafety Alignment
L
Li Guo
Beijing University of Posts and Telecommunications, China
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning
Shibiao Xu
Shibiao Xu
Beijing University of Posts and Telecommunications
Computer VisionMachine LearningComputer Graphics