GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

📅 2024-12-19

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Embodied Question Answering (EQA) faces challenges including weak semantic representation, difficulty in online memory updating, and insufficient utilization of prior knowledge. To address these, we propose an EQA framework grounded in a real-time, incremental Metric-Semantic 3D Scene Graph (3DSG), which serves as a multimodal memory: it fuses task-relevant images with vision-language models (VLMs) to achieve cross-modal semantic grounding. We further design a semantic-guided hierarchical planning mechanism leveraging the intrinsic hierarchical structure of the 3DSG, enabling efficient exploration and reasoning under dynamic environmental conditions. Our method is deployed and evaluated in both the HM-EQA simulator and real-world home/office environments, achieving significantly higher task success rates and reduced average planning steps compared to state-of-the-art baselines. The core contribution is the first online 3DSG-based memory paradigm tailored for EQA, establishing a closed-loop, semantics-driven hierarchical embodied decision-making architecture.

Technology Category

Application Category

📝 Abstract

In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.

Problem

Research questions and friction points this paper is trying to address.

Developing semantic understanding of unseen environments for embodied question answering

Addressing challenges in obtaining and updating semantic representations online

Leveraging prior knowledge for efficient robotic planning and exploration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time 3D semantic scene graphs for environment representation

Hierarchical planning exploiting graph structure for efficient exploration

Multi-modal memory with images to ground vision-language models

🔎 Similar Papers

No similar papers found.