🤖 AI Summary
Existing imitation learning approaches suffer from fragmented viewpoint modeling—either focusing exclusively on egocentric or exocentric views—and lack cognitively grounded mechanisms, hindering faithful replication of the human “observe–act” closed-loop. To address this, we introduce EgoMe, the first large-scale, real-world paired egocentric-exocentric video dataset, comprising 7,902 synchronized ego-exo video pairs augmented with multimodal sensor data (eye-tracking, IMU, magnetometer). We systematically incorporate human inter-view transformation cognition into imitation learning, designing eight novel benchmark tasks. Through multi-camera synchronized acquisition, precise temporal alignment, semantic annotation, and statistics-driven quality validation, EgoMe achieves superior viewpoint consistency, behavioral diversity, modality richness, and real-world deployability compared to prior datasets. Empirical results demonstrate significant improvements in end-to-end robotic imitation performance and cross-view action generalization.
📝 Abstract
When interacting with the real world, human often take the egocentric (first-person) view as a benchmark, naturally transferring behaviors observed from a exocentric (third-person) view to their own. This cognitive theory provides a foundation for researching how robots can more effectively imitate human behavior. However, current research either employs multiple cameras with different views focusing on the same individual's behavior simultaneously or encounters unpair ego-exo view scenarios, there is no effort to fully exploit human cognitive behavior in the real world. To fill this gap, in this paper, we introduce a novel large-scale egocentric dataset, called EgoMe, which towards following the process of human imitation learning via egocentric view in the real world. Our dataset includes 7902 pairs of videos (15804 videos) for diverse daily behaviors in real-world scenarios. For a pair of videos, one video captures a exocentric view of the imitator observing the demonstrator's actions, while the other captures a egocentric view of the imitator subsequently following those actions. Notably, our dataset also contain exo-ego eye gaze, angular velocity, acceleration, magnetic strength and other sensor multi-modal data for assisting in establishing correlations between observing and following process. In addition, we also propose eight challenging benchmark tasks for fully leveraging this data resource and promoting the research of robot imitation learning ability. Extensive statistical analysis demonstrates significant advantages compared to existing datasets. The proposed EgoMe dataset and benchmark will be released soon.