EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video

📅 2025-05-16

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Imitation learning for dexterous manipulation is hindered by the scarcity of high-quality egocentric data, while existing datasets (e.g., Ego4D) lack fine-grained hand pose annotations and are not specifically designed for object manipulation tasks. Method: We introduce EgoDex—the largest egocentric dexterous manipulation dataset to date—comprising 829 hours of video with millimeter-accurate, finger-level 3D joint trajectories across 194 everyday desktop tasks. It is the first dataset to enable native, high-fidelity synchronized acquisition of 3D hand motion and egocentric video using Apple Vision Pro’s multi-camera system. We further propose the first benchmark suite tailored to dexterous manipulation evaluation. Results: Leveraging EgoDex, we validate state-of-the-art imitation learning policies and trajectory prediction models, demonstrating substantial improvements in robotic manipulation performance, egocentric visual understanding, and foundation model pretraining—thereby advancing core areas of robotics and vision-language learning.

Technology Category

Application Category

📝 Abstract

Imitation learning for manipulation has a well-known data scarcity problem. Unlike natural language and 2D computer vision, there is no Internet-scale corpus of data for dexterous manipulation. One appealing option is egocentric human video, a passively scalable data source. However, existing large-scale datasets such as Ego4D do not have native hand pose annotations and do not focus on object manipulation. To this end, we use Apple Vision Pro to collect EgoDex: the largest and most diverse dataset of dexterous human manipulation to date. EgoDex has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording, where multiple calibrated cameras and on-device SLAM can be used to precisely track the pose of every joint of each hand. The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks ranging from tying shoelaces to folding laundry. Furthermore, we train and systematically evaluate imitation learning policies for hand trajectory prediction on the dataset, introducing metrics and benchmarks for measuring progress in this increasingly important area. By releasing this large-scale dataset, we hope to push the frontier of robotics, computer vision, and foundation models.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale datasets for dexterous manipulation learning

Absence of hand pose annotations in existing egocentric video datasets

Need for diverse manipulation benchmarks and imitation learning policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Apple Vision Pro for 3D hand tracking

Collects 829 hours of egocentric manipulation video

Trains imitation learning for hand trajectory prediction

🔎 Similar Papers

Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation