EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

📅 2024-10-26
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
Existing embodied question answering (EQA) methods either rely on static video understanding—lacking active exploration capabilities—or operate within closed-answer spaces, failing to meet real-world domestic robots’ dual requirements of open-vocabulary response generation and efficient navigation. This paper proposes the first end-to-end open-vocabulary EQA framework. It introduces a semantic-value-weighted frontier exploration strategy for goal-directed active navigation; a dynamic termination mechanism based on observation relevance detection to avoid redundant interactions; and a robust answer generation pipeline integrating retrieval-augmented generation (RAG), BLIP-based image retrieval, and confidence calibration of black-box vision-language models. Evaluated on standard EQA benchmarks, our method achieves a 15.3% absolute improvement in question-answering accuracy and reduces average exploration steps by 21.7%, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract
Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering without embodied exploration or rely on closed-form choice sets. In real-world scenarios, a robotic agent must efficiently explore and accurately answer questions in open-vocabulary settings. To address these challenges, we propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering. In EfficientEQA, the robot actively explores unknown environments using Semantic-Value-Weighted Frontier Exploration, a strategy that prioritizes exploration based on semantic importance provided by calibrated confidence from black-box VLMs to quickly gather relevant information. To generate accurate answers, we employ Retrieval-Augmented Generation (RAG), which utilizes BLIP to retrieve useful images from accumulated observations and VLM reasoning to produce responses without relying on predefined answer choices. Additionally, we detect observations that are highly relevant to the question as outliers, allowing the robot to determine when it has sufficient information to stop exploring and provide an answer. Experimental results demonstrate the effectiveness of our approach, showing an improvement in answering accuracy by over 15% and efficiency, measured in running steps, by over 20% compared to state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Enables efficient exploration for open-vocabulary embodied question answering
Overcomes limitations of static video QA and closed-answer sets
Improves accuracy and reduces exploration steps in robot assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantic-Value-Weighted Frontier Exploration prioritizes key areas
BLIP relevancy mechanism stops exploration adaptively
Retrieval-Augmented Generation enhances answer accuracy
🔎 Similar Papers
No similar papers found.