How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos

📅 2026-04-16
📈 Citations: 0
Influential: 0
📄 PDF

career value

194K/year
🤖 AI Summary
This work addresses the scarcity of naturalistic and consistently annotated human errors and recovery behaviors in existing procedural video datasets, which hinders the development of reliable error-aware monitoring systems. To bridge this gap, the authors propose PIE-V, a novel framework that, for the first time, incorporates psychology-informed error and recovery planning to inject controllable, cognitively plausible mistakes into clean procedural videos and explicitly model their correction. The approach integrates a unified error taxonomy, a nine-dimensional human evaluation protocol, a cascaded consistency-preserving LLM rewriting module, and text-guided video generation to systematically produce error–correction pairs. Evaluated across 17 tasks and 50 Ego-Exo4D scenarios, the method successfully generates 102 errors and 27 corresponding recoveries, with human assessments confirming its significant superiority over baseline methods in step-wise logical coherence, state continuity, and text–video alignment.

Technology Category

Application Category

📝 Abstract
Reliable procedural monitoring in video requires exposure to naturally occurring human errors and the recoveries that follow. In egocentric recordings, mistakes are often partially occluded by hands and revealed through subtle object state changes, while existing procedural datasets provide limited and inconsistent mistake and correction traces. We present PIE-V (Psychologically Inspired Error injection for Videos), a framework for constructing and benchmarking mistake-aware egocentric procedural videos by augmenting clean keystep procedures with controlled, human-plausible deviations. PIE-V combines a psychology-informed error planner conditioned on procedure phase and semantic step load, a correction planner that models recovery behavior, an LLM writer that performs cascade-consistent rewrites, and an LLM judge that validates procedural coherence and repairs failures. For video segment edits, PIE-V synthesizes replacement clips with text-guided video generation and stitches them into the episode to preserve visual plausibility. Applied to 17 tasks and 50 Ego-Exo4D scenarios, PIE-V injects 102 mistakes and generates 27 recovery corrections. For benchmarking, we introduce a unified taxonomy and a human rubric with nine metrics that cover step-level and procedure-level quality, including plausibility, procedure logic with annotator confidence, state change coherence, and grounding between text and video. Using this protocol, we audit several existing resources and compare PIE-V against a freeform LLM generation baseline under the same criteria. Together, the framework and rubric support post-completion verification for egocentric procedural mistake detection and correction.
Problem

Research questions and friction points this paper is trying to address.

egocentric procedural videos
human errors
mistake detection
recovery behavior
procedural monitoring
Innovation

Methods, ideas, or system contributions that make the work stand out.

mistake-aware video generation
egocentric procedural understanding
psychologically inspired error injection
LLM-guided video editing
procedural recovery benchmarking