🤖 AI Summary
Existing embodied question answering (EQA) methods rely on vision-language models (VLMs) for direct interaction, lacking explicit reasoning and planning—leading to inefficient exploration and inaccurate answers. This paper introduces ToolEQA, the first framework to deeply integrate external tool invocation with multi-step reasoning, establishing a dynamic, embodied reasoning paradigm. To support scalable training and generalization, we design an automated trajectory–question–answer generation pipeline, yielding EQA-RT, a large-scale benchmark comprising 18K diverse tasks. On the EQA-RT test set, ToolEQA achieves a 9.2–20.2% absolute improvement in success rate and surpasses baselines by 10% in zero-shot performance. It also attains state-of-the-art results on HM-EQA, OpenEQA, and EXPRESS-Bench. Key contributions include: (1) a tool-augmented multi-step reasoning architecture; (2) a scalable, automated paradigm for embodied QA data construction; and (3) a robust agent design explicitly optimized for generalization in realistic 3D environments.
📝 Abstract
Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.