FAST-EQA: Efficient Embodied Question Answering with Global and Local Region Relevancy

📅 2026-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of efficient reasoning in partially observable environments for embodied question answering by proposing a question-conditioned framework that integrates question-guided target identification and global region-of-interest scoring to drive navigation. The approach combines bounded visual memory with chain-of-thought reasoning to generate answers, introducing a novel bounded scene memory mechanism and a global-local collaborative exploration strategy based on high-value frontiers. This unified architecture effectively handles both single- and multi-target questions while significantly improving reasoning speed and answer reliability. The method achieves state-of-the-art performance on HMEQA and EXPRESS-Bench and demonstrates strong results on OpenEQA and MT-HM3D, exhibiting markedly superior inference efficiency compared to existing approaches.

Technology Category

Application Category

📝 Abstract
Embodied Question Answering (EQA) combines visual scene understanding, goal-directed exploration, spatial and temporal reasoning under partial observability. A central challenge is to confine physical search to question-relevant subspaces while maintaining a compact, actionable memory of observations. Furthermore, for real-world deployment, fast inference time during exploration is crucial. We introduce FAST-EQA, a question-conditioned framework that (i) identifies likely visual targets, (ii) scores global regions of interest to guide navigation, and (iii) employs Chain-of-Thought (CoT) reasoning over visual memory to answer confidently. FAST-EQA maintains a bounded scene memory that stores a fixed-capacity set of region-target hypotheses and updates them online, enabling robust handling of both single and multi-target questions without unbounded growth. To expand coverage efficiently, a global exploration policy treats narrow openings and doors as high-value frontiers, complementing local target seeking with minimal computation. Together, these components focus the agent's attention, improve scene coverage, and improve answer reliability while running substantially faster than prior approaches. On HMEQA and EXPRESS-Bench, FAST-EQA achieves state-of-the-art performance, while performing competitively on OpenEQA and MT-HM3D.
Problem

Research questions and friction points this paper is trying to address.

Embodied Question Answering
partial observability
efficient exploration
scene memory
real-time inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Question Answering
Chain-of-Thought Reasoning
Bounded Scene Memory
Global-Local Exploration
Efficient Navigation
🔎 Similar Papers
No similar papers found.