Done, But Not Sure: Disentangling World Completion from Self-Termination in Embodied Agents

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Current evaluation protocols for embodied agents often conflate whether a task is genuinely completed with whether the agent terminates correctly, thereby obscuring distinct failure modes. To address this, this work proposes VIGIL, an evaluation framework that relies solely on first-person RGB observations, eschews action success signals, and requires agents to actively conclude episodes via semantic reports. This design decouples task completion (W) from benchmark success (B), enabling, for the first time, an independent measure of terminal commitment capability. VIGIL explicitly distinguishes four behavioral categories: omission of execution, post-goal drifting, unjustified commitment, and verified success. Experiments across 20 models reveal that agents with similar W scores can differ by up to 19.7 percentage points in B, and while action feedback improves W, it does not automatically enhance terminal commitment—highlighting the necessity of decoupled evaluation.

📝 Abstract

Standard embodied evaluations do not independently score whether an agent correctly commits to task completion at episode closure, a capacity we call terminal commitment. Behaviorally distinct failures--never completing the task, completing it but failing to stop, and reporting success without sufficient evidence--collapse into the same benchmark failure. We introduce VIGIL, an evaluation framework that makes terminal commitment independently measurable. Under VIGIL's default protocol, agents observe only egocentric RGB, receive no action-success signals, and must end each episode with a semantic report checked deterministically against hidden world state. This yields two separate scores: world-state completion (W) and benchmark success (B), where B additionally requires a correct terminal report. This decoupling makes four outcome categories distinguishable: missed execution, post-attainment drift, unsupported commitment, and verified success. Across 20 models on 1,000 frozen episodes, systems with comparable W differ by up to 19.7 pp in B: one model converts achieved states into correct reports, while another with near-identical execution drifts past the goal without closing. An action-feedback intervention further tests the separation: execution-oriented signals improve W broadly, yet commitment failures persist in models that do not already ground terminal reports in the achieved state. VIGIL provides a protocol that makes terminal commitment independently visible and scorable.

Problem

Research questions and friction points this paper is trying to address.

terminal commitment

embodied agents

task completion

evaluation framework

world-state completion

Innovation

Methods, ideas, or system contributions that make the work stand out.

terminal commitment

embodied agents

evaluation framework