🤖 AI Summary
This work addresses the perceptual-reasoning disconnect in large language model (LLM)-driven embodied agents operating in multimodal environments, which often leads to overlooking task-critical information. To bridge this gap, the authors propose a Dynamic Question Answering (DQA) framework that establishes a closed-loop collaboration between a Vision-Language Model (VLM) and an LLM, enabling the LLM to actively guide the VLM in generating concise, task-oriented image descriptions. This approach achieves fully automated, interactive, goal-directed perception without requiring manually designed questions or answers, thereby tightly integrating perception and decision-making. Evaluated on the ALFWorld and Room-to-Room benchmarks, the method significantly outperforms existing image-driven models, demonstrating consistent and systematic performance gains.
📝 Abstract
Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.