Procedural Mistake Detection via Action Effect Modeling

📅 2025-12-03

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing procedural task error detection methods primarily focus on action execution sequences while neglecting their physical effects—such as changes in object states or spatial relationships—leading to insufficient identification of outcome-oriented errors. To address this, we propose an Action Effect Modeling (AEM) framework that explicitly incorporates action effects into error detection for the first time, establishing a unified probabilistic model integrating both execution dynamics and resultant physical effects. Our approach leverages effect-frame selection, visual grounding, symbolic scene graph construction, and cross-modal alignment to learn effect-aware representations within a shared latent space. Furthermore, we design a prompt-based detector enabling fine-grained semantic alignment between predicted and expected effects. Evaluated on the EgoPER and CaptainCook4D single-class classification benchmarks, our method achieves state-of-the-art performance, demonstrating that explicit modeling of action effects is critical for enhancing the reliability and robustness of procedural error detection.

Technology Category

Application Category

📝 Abstract

Mistake detection in procedural tasks is essential for building intelligent systems that support learning and task execution. Existing approaches primarily analyze how an action is performed, while overlooking what it produces, i.e., the extbf{action effect}. Yet many errors manifest not in the execution itself but in the resulting outcome, such as an unintended object state or incorrect spatial arrangement. To address this gap, we propose Action Effect Modeling (AEM), a unified framework that jointly captures action execution and its outcomes through a probabilistic formulation. AEM first identifies the outcome of an action by selecting the most informative effect frame based on semantic relevance and visual quality. It then extracts complementary cues from visual grounding and symbolic scene graphs, aligning them in a shared latent space to form robust effect-aware representations. To detect mistakes, we further design a prompt-based detector that incorporates task-specific prompts and aligns each action segment with its intended execution semantics. Our approach achieves state-of-the-art performance on the EgoPER and CaptainCook4D benchmarks under the challenging one-class classification (OCC) setting. These results demonstrate that modeling both execution and outcome yields more reliable mistake detection, and highlight the potential of effect-aware representations to benefit a broader range of downstream applications.

Problem

Research questions and friction points this paper is trying to address.

Detects procedural mistakes by modeling action effects

Addresses errors in outcomes, not just execution methods

Uses visual and symbolic cues for robust effect-aware representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Action Effect Modeling captures execution and outcomes probabilistically

Selects informative effect frames using semantic relevance and visual quality

Uses prompt-based detector with task-specific prompts for mistake detection

🔎 Similar Papers

Addressing and Visualizing Misalignments in Human Task-Solving Trajectories