Embodied Intelligence for 3D Understanding: A Survey on 3D Scene Question Answering

📅 2025-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D Scene Question Answering (3D SQA) faces core challenges including dataset heterogeneity, inefficient multimodal fusion, and inconsistent task formulation. Method: This paper presents the first systematic survey of 3D SQA, establishing a standardized analytical framework covering datasets, methodologies, and evaluation protocols. It introduces the first taxonomy of 3D SQA methods, unifying diverse 3D representations—point clouds, voxels, and NeRF—and integrating 3D visual understanding, natural language processing, and large language model techniques such as instruction tuning and zero-shot transfer. Contribution/Results: The survey identifies three critical bottlenecks: insufficient data standardization, weak cross-modal alignment, and the absence of embodied tasks. It proposes three future directions: unified cross-benchmark datasets, explicit cross-modal alignment mechanisms, and embodied-task extensions. This work provides both theoretical foundations and practical paradigms for semantic understanding and interactive reasoning in 3D environments.

Technology Category

Application Category

📝 Abstract
3D Scene Question Answering (3D SQA) represents an interdisciplinary task that integrates 3D visual perception and natural language processing, empowering intelligent agents to comprehend and interact with complex 3D environments. Recent advances in large multimodal modelling have driven the creation of diverse datasets and spurred the development of instruction-tuning and zero-shot methods for 3D SQA. However, this rapid progress introduces challenges, particularly in achieving unified analysis and comparison across datasets and baselines. This paper presents the first comprehensive survey of 3D SQA, systematically reviewing datasets, methodologies, and evaluation metrics while highlighting critical challenges and future opportunities in dataset standardization, multimodal fusion, and task design.
Problem

Research questions and friction points this paper is trying to address.

3D Scene Question Answering
Fair Comparison
Data Standardization
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Scene Question Answering
Multimodal Information Fusion
Dataset Standardization
🔎 Similar Papers
No similar papers found.
Zechuan Li
Zechuan Li
Hunan University
Point cloudDeep Learning,3D Object Detection
H
Hongshan Yu
Hunan University
Yihao Ding
Yihao Ding
The University of Western Australia
Multimodal LearningDocument UnderstandingInterdisciplinary AI
Y
Yan Li
The University of Sydney
Y
Yong He
Anhui University
N
Naveed Akhtar
The University of Melbourne