S-EQA: Tackling Situational Queries in Embodied Question Answering

📅 2024-05-08
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses situational embodied question answering (S-EQA)—a novel task requiring multi-object state reasoning in domestic settings (e.g., “Is the house ready for sleep?”)—which extends beyond conventional embodied QA’s focus on single-object attribute queries. We formally define S-EQA as a new benchmark task and propose the Prompt-Generate-Evaluate (PGE) generative framework. Leveraging semantic deduplication and MTurk crowdsourcing, we construct the first high-quality S-EQA benchmark, achieving a 97.26% answerable rate. Empirical analysis reveals a critical phenomenon: large language models excel at *generating* situation-aware responses but struggle with *direct* S-EQA answering. To bridge this gap, we introduce an object-state consensus mechanism that simplifies situational queries; integrated into VQA evaluation, it improves accuracy by 15.31 percentage points—reaching 61.51%, significantly surpassing the baseline situational response accuracy of 46.2%.

Technology Category

Application Category

📝 Abstract
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as"Is the house ready for sleeptime?") are more challenging requiring the agent to identify multiple objects (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries and corresponding consensus object information. PGE maintains uniqueness among the generated queries, using semantic similarity via a feedback loop. We annotate the generated data for ground truth answers via a large scale user-study conducted on M-Turk, and with a high answerability rate of 97.26%, establish that LLMs are good at generating situational data. However, using the same LLM to answer the queries gives a low success rate of 46.2%; indicating that while LLMs are good at generating query data, they are poor at answering them. We use images from the VirtualHome simulator with the S-EQA queries establish an evaluation benchmark via Visual Question Answering (VQA). We report an improved accuracy of 15.31% while using queries framed from the generated object consensus for VQA over directly answering situational ones, indicating that such simplification is necessary for improved performance. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries that also uses a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents in household environments through this work.
Problem

Research questions and friction points this paper is trying to address.

Addressing Embodied Question Answering with Situational Queries in households.
Generating and evaluating situational queries using a Prompt-Generate-Evaluate scheme.
Exploring LLM limitations in answering situational queries accurately and commonsensically.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces Prompt-Generate-Evaluate (PGE) scheme
Generates situational queries using LLMs
Evaluates LLM performance in real-world scenarios