TI-PREGO: Chain of Thought and In-Context Learning for Online Mistake Detection in PRocedural EGOcentric Videos

📅 2024-11-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Real-time online detection of open-set errors (i.e., unknown or novel errors) in first-person procedural videos remains challenging due to their unstructured nature and lack of prior error annotations. Method: We propose a dual-branch online architecture: an action recognition branch performs frame-level, streaming action parsing; an LLM-driven prediction branch conducts chain-of-thought reasoning over action sequences and forecasts subsequent steps, localizing errors via inconsistency between recognition and prediction. The method integrates video action recognition, action token aggregation, and in-context learning from large language models, enabling millisecond-scale inference and dynamic streaming input. Results: Evaluated on two procedural video benchmarks, our approach significantly outperforms state-of-the-art methods in open-set error detection, ultra-low latency (<50 ms), and cross-task generalization. It demonstrates strong robustness and effectiveness without requiring predefined error samples—constituting the first online error detection framework for high-reliability domains such as manufacturing and healthcare.

Technology Category

Application Category

📝 Abstract
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare, and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, however, no technique effectively detects open-set procedural mistakes online. We propose a dual branch architecture to address this problem in an online fashion: one branch continuously performs step recognition from the input egocentric video, while the other anticipates future steps based on the recognition module's output. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module. The recognition branch takes input frames, predicts the current action, and aggregates frame-level results into action tokens. The anticipation branch, specifically, leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Given the online nature of the task, we also thoroughly benchmark the difficulties associated with per-frame evaluations, particularly the need for accurate and timely predictions in dynamic online scenarios. Extensive experiments on two procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach. In a thorough evaluation including recognition and anticipation variants and state-of-the-art models, our method reveals its robustness and effectiveness in online applications.
Problem

Research questions and friction points this paper is trying to address.

Detects procedural errors in real-time egocentric videos
Addresses open-set mistakes without prior failure examples
Uses dual-branch architecture for action recognition and anticipation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual branch architecture for mistake detection
LLMs predict future actions from past tokens
Online step recognition and anticipation mismatch
🔎 Similar Papers
No similar papers found.