Multi-Step Reasoning for Embodied Question Answering via Tool Augmentation

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing embodied question answering (EQA) methods rely on vision-language models (VLMs) for direct interaction, lacking explicit reasoning and planning—leading to inefficient exploration and inaccurate answers. This paper introduces ToolEQA, the first framework to deeply integrate external tool invocation with multi-step reasoning, establishing a dynamic, embodied reasoning paradigm. To support scalable training and generalization, we design an automated trajectory–question–answer generation pipeline, yielding EQA-RT, a large-scale benchmark comprising 18K diverse tasks. On the EQA-RT test set, ToolEQA achieves a 9.2–20.2% absolute improvement in success rate and surpasses baselines by 10% in zero-shot performance. It also attains state-of-the-art results on HM-EQA, OpenEQA, and EXPRESS-Bench. Key contributions include: (1) a tool-augmented multi-step reasoning architecture; (2) a scalable, automated paradigm for embodied QA data construction; and (3) a robust agent design explicitly optimized for generalization in realistic 3D environments.

Technology Category

Application Category

📝 Abstract
Embodied Question Answering (EQA) requires agents to explore 3D environments to obtain observations and answer questions related to the scene. Existing methods leverage VLMs to directly explore the environment and answer questions without explicit thinking or planning, which limits their reasoning ability and results in excessive or inefficient exploration as well as ineffective responses. In this paper, we introduce ToolEQA, an agent that integrates external tools with multi-step reasoning, where external tools can provide more useful information for completing the task, helping the model derive better exploration directions in the next step of reasoning and thus obtaining additional effective information. This enables ToolEQA to generate more accurate responses with a shorter exploration distance. To enhance the model's ability for tool-usage and multi-step reasoning, we further design a novel EQA data generation pipeline that automatically constructs large-scale EQA tasks with reasoning trajectories and corresponding answers. Based on the pipeline, we collect the EQA-RT dataset that contains about 18K tasks, divided into a training set EQA-RT-Train, and two test sets EQA-RT-Seen (scenes overlapping with the training set) and EQA-RT-Unseen (novel scenes). Experiments on EQA-RT-Seen and EQA-RT-Unseen show that ToolEQA improves the success rate by 9.2~20.2% over state-of-the-art baselines, while outperforming the zero-shot ToolEQA by 10% in success rate. In addition, ToolEQA also achieves state-of-the-art performance on the HM-EQA, OpenEQA, and EXPRESS-Bench datasets, demonstrating its generality. Our homepage see https://tooleqa.github.io.
Problem

Research questions and friction points this paper is trying to address.

Existing EQA methods lack explicit reasoning and planning capabilities
Current approaches result in inefficient exploration and ineffective responses
Need to enhance embodied question answering with multi-step reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates external tools with multi-step reasoning
Automatically constructs large-scale EQA training data
Improves success rate over state-of-the-art baselines
🔎 Similar Papers
No similar papers found.
M
Mingliang Zhai
Beijing Key Laboratory of Intelligent Information Technology, Beijing Institute of Technology, Beijing, China
H
Hansheng Liang
School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
Xiaomeng Fan
Xiaomeng Fan
Beijing Institute of Technology
machine learningcomputer vision
Z
Zhi Gao
Beijing Key Laboratory of Intelligent Information Technology, Beijing Institute of Technology, Beijing, China
C
Chuanhao Li
Shanghai Artificial Intelligence Laboratory, Shanghai, China
C
Che Sun
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, Shenzhen, Guangdong, China
X
Xu Bin
School of Mechanical Engineering, Beijing Institute of Technology, Beijing, China
Yuwei Wu
Yuwei Wu
Ph.D. candidate, GRASP Lab, University of Pennsylvania
RoboticsTrajectory OptimizationTask and Motion Planning
Y
Yunde Jia
Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University, Shenzhen, Guangdong, China