PRISM: Perception Reasoning Interleaved for Sequential Decision Making

📅 2026-05-06
📈 Citations: 0
Influential: 0
📄 PDF

career value

192K/year
🤖 AI Summary
This work addresses the perceptual-reasoning disconnect in large language model (LLM)-driven embodied agents operating in multimodal environments, which often leads to overlooking task-critical information. To bridge this gap, the authors propose a Dynamic Question Answering (DQA) framework that establishes a closed-loop collaboration between a Vision-Language Model (VLM) and an LLM, enabling the LLM to actively guide the VLM in generating concise, task-oriented image descriptions. This approach achieves fully automated, interactive, goal-directed perception without requiring manually designed questions or answers, thereby tightly integrating perception and decision-making. Evaluated on the ALFWorld and Room-to-Room benchmarks, the method significantly outperforms existing image-driven models, demonstrating consistent and systematic performance gains.
📝 Abstract
Scaling LLM-based embodied agents from text-only environments to complex multimodal settings remains a major challenge. Recent work identifies a perception-reasoning-decision gap in standalone Vision-Language Models (VLMs), which often overlook task-critical information. In this paper, we introduce PRISM, a framework that tightly couples perception (VLM) and decision (LLM) through a dynamic question-answer (DQA) pipeline. Instead of passively accepting the VLM's description, the LLM critiques it, probes the VLM with goal-oriented questions, and synthesizes a compact image description. This closed-loop interaction yields a sharp, task-driven understanding of the scene. We evaluate PRISM on the ALFWorld and Room-to-Room (R2R) benchmarks. We show that: (1) PRISM significantly outperforms state-of-the-art image-based models, (2) our Interactive goal-oriented perception pipeline yields systematic and substantial gains, and (3) PRISM is fully automatic, eliminating the need for handcrafted questions or answers.
Problem

Research questions and friction points this paper is trying to address.

perception-reasoning-decision gap
Vision-Language Models
multimodal embodied agents
sequential decision making
task-critical information
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception-Reasoning Interleaving
Dynamic Question-Answering
Embodied Agents
Vision-Language Models
Task-Driven Perception