HD-EPIC: A Highly-Detailed Egocentric Video Dataset

📅 2025-02-06

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the limited fine-grained understanding capability of existing vision-language models (VLMs) in real-world kitchen environments. Methodologically, we introduce the first multimodal first-person video dataset that simultaneously achieves ecological validity (collected across 9 real household kitchens, 41 hours of RGB video + audio + eye-tracking) and lab-grade annotation precision: leveraging digital twin technology for 3D scene reconstruction, we manually annotate recipe steps, fine-grained actions (59K), ingredient nutritional attributes, moving object trajectories (20K), 3D lift masks (37K), audio events, and gaze anchors. Key contributions include: (1) a novel high-density, multimodal, 3D-grounded annotation paradigm; (2) the release of a challenging 26K-question VQA benchmark spanning 7 semantic dimensions. Experiments reveal that Gemini Pro achieves only 38.5% accuracy, exposing fundamental limitations of current VLMs in recipe reasoning, 3D perception, and cross-modal alignment.

Technology Category

Application Category

📝 Abstract

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations. Importantly, all annotations are grounded in 3D through digital twinning of the scene, fixtures, object locations, and primed with gaze. Footage is collected from unscripted recordings in diverse home environments, making HDEPIC the first dataset collected in-the-wild but with detailed annotations matching those in controlled lab environments. We show the potential of our highly-detailed annotations through a challenging VQA benchmark of 26K questions assessing the capability to recognise recipes, ingredients, nutrition, fine-grained actions, 3D perception, object motion, and gaze direction. The powerful long-context Gemini Pro only achieves 38.5% on this benchmark, showcasing its difficulty and highlighting shortcomings in current VLMs. We additionally assess action recognition, sound recognition, and long-term video-object segmentation on HD-EPIC. HD-EPIC is 41 hours of video in 9 kitchens with digital twins of 413 kitchen fixtures, capturing 69 recipes, 59K fine-grained actions, 51K audio events, 20K object movements and 37K object masks lifted to 3D. On average, we have 263 annotations per minute of our unscripted videos.

Problem

Research questions and friction points this paper is trying to address.

Egocentric video dataset validation

Detailed kitchen activity annotations

Challenging VQA benchmark assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Digital twinning for 3D annotations

Unscripted video data collection

Detailed egocentric video dataset

🔎 Similar Papers

No similar papers found.