🤖 AI Summary
This work addresses the challenge of early error detection in procedural videos by proposing an adaptive observation framework that integrates an error detector with reinforcement learning. The method estimates the correctness of critical steps by analyzing recent video frames and predicts future visual features, while incorporating a reinforcement learning–based dynamic exit policy to jointly optimize observation duration and detection accuracy—a first for this task. Experimental results on multiple real-world procedural video datasets demonstrate that the proposed approach significantly outperforms existing methods, achieving substantially higher detection accuracy while requiring observation of a markedly smaller proportion of the video sequence.
📝 Abstract
We introduce the task of early mistake detection in video, where the goal is to determine whether a keystep in a procedural activity is performed correctly while observing as little of the streaming video as possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early mistake estimates. Meanwhile, the policy aggregates the detector outputs and visual observations over time and adaptively decides when to exit (i.e., stop processing incoming frames) while producing the final prediction. Using diverse real-world procedural video datasets, we demonstrate that our MistExit model achieves superior mistake detection accuracy while reducing the fraction of video observed compared to state-of-the-art models. Project: https://vision.cs.utexas.edu/projects/mist_exit.