EgoOops: A Dataset for Mistake Action Detection from Egocentric Videos Referring to Procedural Texts

📅 2024-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing error detection methods primarily target visually salient errors in unconstrained activities, rendering them inadequate for procedural tasks (e.g., assembly, maintenance) where action correctness strictly depends on adherence to step-by-step textual instructions—yet no benchmark video dataset exists with multi-domain procedural text annotations. Method: We introduce the first egocentric error action detection dataset tailored for text-following tasks, featuring video–text alignment, fine-grained error type labels, and natural language error descriptions across diverse procedural domains. We propose a novel text-guided error modeling paradigm, implemented via a multimodal joint learning framework that integrates CLIP with a temporal action encoder to enable instruction-driven fine-grained alignment and discriminative error detection. Contribution/Results: Experiments show that incorporating procedural text improves error detection accuracy by 18.3%. The dataset is publicly released, establishing a foundation for text-guided embodied intelligence evaluation.

Technology Category

Application Category

📝 Abstract
Mistake action detection is crucial for developing intelligent archives that detect workers' errors and provide feedback. Existing studies have focused on visually apparent mistakes in free-style activities, resulting in video-only approaches to mistake detection. However, in text-following activities, models cannot determine the correctness of some actions without referring to the texts. Additionally, current mistake datasets rarely use procedural texts for video recording except for cooking. To fill these gaps, this paper proposes the EgoOops dataset, where egocentric videos record erroneous activities when following procedural texts across diverse domains. It features three types of annotations: video-text alignment, mistake labels, and descriptions for mistakes. We also propose a mistake detection approach, combining video-text alignment and mistake label classification to leverage the texts. Our experimental results show that incorporating procedural texts is essential for mistake detection. Data is available through https://y-haneji.github.io/EgoOops-project-page/.
Problem

Research questions and friction points this paper is trying to address.

Develops intelligent archives for error detection.
Addresses mistake detection in text-following activities.
Proposes dataset with video-text alignment annotations.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces EgoOops dataset for mistake detection.
Combines video-text alignment for error identification.
Utilizes procedural texts across diverse activity domains.
Y
Yuto Haneji
Kyoto University
Taichi Nishimura
Taichi Nishimura
PlayStation
MultimediaComputer VisionNatural Language Processing
Hirotaka Kameko
Hirotaka Kameko
Assistant Professor, Kyoto University
Natural Language ProcessingGame AI
Keisuke Shirai
Keisuke Shirai
AIST
Natural Language ProcessingRobotics
Tomoya Yoshida
Tomoya Yoshida
Kyoto University
K
Keiya Kajimura
Kyoto University
K
Koki Yamamoto
Kyoto University
T
Taiyu Cui
Kyoto University
T
Tomohiro Nishimoto
Kyoto University
S
Shinsuke Mori
Kyoto University