Extending Embodied Question Answering from Perception to Decision

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragmentation of existing embodied question answering (EQA) datasets, which hinders comprehensive evaluation of agents’ integrated capabilities in perception, reasoning, and decision-making. To this end, we introduce EQA-Decision—the first large-scale EQA benchmark designed to assess the full decision-making pipeline—systematically unifying four key dimensions: static scene understanding, spatial relation modeling, dynamic task reasoning, and real-time action decision. The benchmark provides over 4 million hierarchically annotated question-answer pairs. Building upon it, we develop RoboDecision, a baseline model that jointly models perception, reasoning, and decision-making for the first time. Experiments demonstrate that this benchmark substantially enhances the performance of vision-language models in spatial and interactive reasoning, offering a unified evaluation framework and a solid foundation for embodied intelligence research.
📝 Abstract
Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.
Problem

Research questions and friction points this paper is trying to address.

Embodied Question Answering
reasoning
decision-making
benchmark
embodied intelligence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Question Answering
EQA-Decision
RoboDecision
embodied reasoning
visual-language models