EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

📅 2024-03-24

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 22

✨ Influential: 1

career value

199K/year

🤖 AI Summary

This work addresses the challenge of modeling procedural activities across asynchronous, heterogeneous viewpoints—specifically, egocentric (first-person) and exocentric (third-person) perspectives. To this end, we introduce the first large-scale paired dataset comprising 120 hours of real-world videos, synchronized eye-tracking data, and fine-grained multimodal annotations. We propose a novel “observe-learn-execute” acquisition paradigm to achieve temporal alignment and semantic action correspondence across viewpoints. Methodologically, we integrate multi-device synchronized recording, action segmentation, cross-view alignment annotation, and a multimodal representation learning framework. We further establish the first benchmark for asynchronous multi-view procedural activity understanding. Our open-sourced dataset and code significantly improve cross-view action retrieval accuracy by +23.7%, providing a foundational resource for embodied AI systems to autonomously learn from human demonstrations.

Technology Category

Application Category

📝 Abstract

Being able to map the activities of others into one's own point of view is a fundamental human skill even from a very early age. Taking a step toward understanding this human ability, we introduce EgoExoLearn, a large-scale dataset that emulates the human demonstration following process, in which individuals record egocentric videos as they execute tasks guided by exocentric-view demonstration videos. Focusing on the potential applications in daily assistance and professional support, EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end, we present benchmarks such as crossview association, cross-view action planning, and crossview referenced skill assessment, along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views, thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world. The dataset and benchmark codes are available at https://github.com/OpenGVLab/EgoExoLearn.

Problem

Research questions and friction points this paper is trying to address.

Bridging asynchronous ego- and exo-centric views of procedural activities.

Modeling human ability to map actions across different viewpoints.

Creating AI agents that learn by observing human activities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset for egocentric and exocentric views

Includes high-quality gaze data and multimodal annotations

Benchmarks for cross-view action and skill assessment

🔎 Similar Papers

No similar papers found.