Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos

📅 2025-11-25

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the fine-grained attribution of human manipulation errors in first-person videos, formalizing the MATT task—jointly modeling semantic roles (which instruction-violating semantic element is erroneous), temporal boundaries (the Point of No Return, PNR), and spatial locations (error regions within the PNR frame). To enable this, we introduce MisEngine, a novel data engine that automatically generates large-scale, multi-dimensional attribution annotations, yielding the first dedicated benchmarks: EPIC-KITCHENS-M and Ego4D-M. We further propose MisFormer, an attention-based unified architecture integrating video-language understanding, temporal action localization, and hand-object interaction analysis. Evaluated on our new datasets and existing benchmarks, MisFormer consistently outperforms strong baselines, achieving state-of-the-art performance across all three attribution dimensions—semantic role identification, PNR localization, and spatial error localization.

Technology Category

Application Category

📝 Abstract

We introduce Mistake Attribution (MATT), a task for fine-grained understanding of human mistakes in egocentric video. Unlike prior mistake understanding work, which lacks fine-grained output, MATT concretely attributes mistakes to the input instruction text or the attempt video. MATT determines what part of the instruction is violated (semantic role), when the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs attribution-rich mistake samples from existing datasets and inherits their annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M, two datasets that are up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic (what), temporal (when), and spatial (where) dimensions, trained using MisEngine supervision. Experiments on our new datasets and prior benchmarks show that MisFormer outperforms strong video-language, temporal localization, hand-object interaction, and mistake-detection baselines.

Problem

Research questions and friction points this paper is trying to address.

Attributing mistakes to instruction text or video attempt

Determining violated instructions and irreversible deviation points

Locating mistakes spatially in Point-of-No-Return frames

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated data engine constructs mistake samples

Unified attention model spans three attribution dimensions

Generates datasets two orders larger than prior

🔎 Similar Papers

Gazing Into Missteps: Leveraging Eye-Gaze for Unsupervised Mistake Detection in Egocentric Videos of Skilled Human Activities