PRISM: : Planning and Reasoning with Intent in Simulated Embodied Environments

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Current evaluations of embodied agents report only aggregate success rates, making it difficult to pinpoint the root causes of failure. This work proposes the first diagnostic benchmark platform, which hierarchically assesses agent capabilities in perception, reasoning, and long-horizon planning across 300 human-validated household tasks situated in five photorealistic multi-room apartments. The platform introduces a modular probing mechanism that enables fine-grained attribution of failure modes for the first time. It supports unified evaluation of diverse systems—including LLMs, VLMs, and symbolic planners—through photorealistic simulation, a human-validated task suite, and an agent-agnostic action API. Experiments reveal that implicit intent understanding is a pervasive bottleneck: lightweight models achieve as low as 20.0% success on long-horizon tasks and exhibit compensatory over-reasoning behaviors.

📝 Abstract

When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.

Problem

Research questions and friction points this paper is trying to address.

embodied AI

diagnostic benchmark

cognitive failure attribution

intent reasoning

long-horizon planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

diagnostic benchmark

embodied AI

intent reasoning

long-horizon planning

agent-agnostic evaluation

🔎 Similar Papers

Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI

2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94