HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

📅 2024-11-28
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of 3D hand–object joint tracking in multi-view egocentric video. We introduce the first large-scale, high-fidelity multimodal multi-view egocentric dataset (833 minutes), covering kitchen, office, and living-room scenarios, with synchronized RGB/monochrome video, eye-tracking, point clouds, and 6DoF object/hand poses; it is the first to include Vicon ground-truth motion capture and physically based rendering (PBR)-quality object meshes. Methodologically, we integrate multi-view geometry with deep learning to jointly estimate hand pose (using UmeTrack/MANO) and object 6DoF pose. We further propose a novel benchmark task: “3D lifting of unknown in-hand objects.” Experiments demonstrate that our multi-view approach significantly outperforms single-view baselines across hand tracking, object pose estimation, and 3D lifting—validating the critical value of multi-view egocentric data for embodied perception modeling.

Technology Category

Application Category

📝 Abstract
We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects. In addition to simple pick-up, observe, and put-down actions, the subjects perform actions typical for a kitchen, office, and living room environment. The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects. The dataset is recorded with two headsets from Meta: Project Aria, which is a research prototype of AI glasses, and Quest 3, a virtual-reality headset that has shipped millions of units. Ground-truth poses were obtained by a motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats, and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.
Problem

Research questions and friction points this paper is trying to address.

Tracking 3D hand and object poses from egocentric multi-view videos
Creating a dataset for diverse real-world interaction scenarios
Improving multi-view methods for 3D hand and object tracking
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view RGB/monochrome images for 3D tracking
Motion-capture system with optical markers
In-house scanner for 3D mesh objects
🔎 Similar Papers
No similar papers found.
Prithviraj Banerjee
Prithviraj Banerjee
Meta Reality Labs
S
Sindi Shkodrani
Meta Reality Labs
P
Pierre Moulon
Meta Reality Labs
Shreyas Hampali
Shreyas Hampali
Research Scientist, Meta Reality Labs
computer visionhand/object pose estimation
S
Shangchen Han
Meta Reality Labs
F
Fan Zhang
Meta Reality Labs
Linguang Zhang
Linguang Zhang
Facebook Reality Labs
Computer VisionRobotics
J
Jade Fountain
Meta Reality Labs
E
Edward Miller
Meta Reality Labs
S
Selen Basol
Meta Reality Labs
Richard Newcombe
Richard Newcombe
VP, Research Science at Reality Labs Research
Artificial IntelligenceAugmented RealityComputer VisionSLAMRobotics
R
Robert Wang
Meta Reality Labs
J
J. Engel
Meta Reality Labs
T
Tomás Hodan
Meta Reality Labs