🤖 AI Summary
This work addresses the online detection of procedural and executional errors in first-person videos—specifically, procedural errors (e.g., step misordering) and executional errors (e.g., motion inaccuracies or tool misuse). We propose the first end-to-end online detection-feedback closed-loop framework that unifies modeling of both error types. Our method integrates temporal action recognition, sliding-window online inference, multimodal feature alignment, and leverages large language models to generate interpretable natural-language feedback. Unlike prior approaches targeting only one error category, ours enables fine-grained, real-time, and explainable joint detection and intervention for both error classes. Evaluated on the HoloAssist benchmark, our framework ranks second in the error detection task, demonstrating robustness and practical utility in real-world industrial and educational settings.
📝 Abstract
In this report, we address the task of online mistake detection, which is vital in domains like industrial automation and education, where real-time video analysis allows human operators to correct errors as they occur. While previous work focuses on procedural errors involving action order, broader error types must be addressed for real-world use. We introduce an online mistake detection framework that handles both procedural and execution errors (e.g., motor slips or tool misuse). Upon detecting an error, we use a large language model (LLM) to generate explanatory feedback. Experiments on the HoloAssist benchmark confirm the effectiveness of our approach, where our approach is placed second on the mistake detection task.