🤖 AI Summary
Current evaluations of embodied agents report only aggregate success rates, making it difficult to pinpoint the root causes of failure. This work proposes the first diagnostic benchmark platform, which hierarchically assesses agent capabilities in perception, reasoning, and long-horizon planning across 300 human-validated household tasks situated in five photorealistic multi-room apartments. The platform introduces a modular probing mechanism that enables fine-grained attribution of failure modes for the first time. It supports unified evaluation of diverse systems—including LLMs, VLMs, and symbolic planners—through photorealistic simulation, a human-validated task suite, and an agent-agnostic action API. Experiments reveal that implicit intent understanding is a pervasive bottleneck: lightweight models achieve as low as 20.0% success on long-horizon tasks and exhibit compensatory over-reasoning behaviors.
📝 Abstract
When an LLM-based embodied agent fails at a household task, the culprit could be misidentified objects, forgotten sub-goals, or poor action sequencing -- yet existing benchmarks report only a single success rate, making it impossible to tell which cognitive module is responsible. We present PRISM, a diagnostic benchmark that reframes this problem: rather than asking only \textit{did the agent succeed?}, PRISM asks \textit{which capability is most likely responsible for failure?} Built on five photorealistic multi-room apartments (4--8 rooms each), PRISM structures 300 human-verified tasks into three capability tiers -- \textit{Basic Ability}, \textit{Reasoning Ability}, and \textit{Long-horizon Ability} -- that isolate perception-to-action grounding, implicit intent resolution, and sustained multi-step coordination respectively. PRISM exposes an agent-agnostic executable action API that allows arbitrary agents: LLM agents, VLM agents, symbolic planners, RL policies, and hybrid systems, to be evaluated end-to-end under the same benchmark protocol. To support deeper diagnosis, optional probes for perception, memory, and planning can be adopted, replaced, or bypassed entirely, enabling controlled component-level analysis when desired. Experiments on seven contemporary LLMs establish a clear hierarchy: explicit spatial grounding is not the dominant failure source under oracle perception, implicit intent resolution is a significant bottleneck for all model families, and long-horizon coordination exposes a stark capability cliff -- lightweight models collapse to as low as 20.0\% success while simultaneously consuming more tokens than their frontier counterparts, a signature of compensatory over-reasoning rather than genuine planning capability. Project page: \href{https://sj-li.com/PROJ/PRISM}{link}.