HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos

📅 2024-11-28

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the problem of 3D hand–object joint tracking in multi-view egocentric video. We introduce the first large-scale, high-fidelity multimodal multi-view egocentric dataset (833 minutes), covering kitchen, office, and living-room scenarios, with synchronized RGB/monochrome video, eye-tracking, point clouds, and 6DoF object/hand poses; it is the first to include Vicon ground-truth motion capture and physically based rendering (PBR)-quality object meshes. Methodologically, we integrate multi-view geometry with deep learning to jointly estimate hand pose (using UmeTrack/MANO) and object 6DoF pose. We further propose a novel benchmark task: “3D lifting of unknown in-hand objects.” Experiments demonstrate that our multi-view approach significantly outperforms single-view baselines across hand tracking, object pose estimation, and 3D lifting—validating the critical value of multi-view egocentric data for embodied perception modeling.

Technology Category

Application Category

📝 Abstract

We introduce HOT3D, a publicly available dataset for egocentric hand and object tracking in 3D. The dataset offers over 833 minutes (3.7M+ images) of recordings that feature 19 subjects interacting with 33 diverse rigid objects. In addition to simple pick-up, observe, and put-down actions, the subjects perform actions typical for a kitchen, office, and living room environment. The recordings include multiple synchronized data streams containing egocentric multi-view RGB/monochrome images, eye gaze signal, scene point clouds, and 3D poses of cameras, hands, and objects. The dataset is recorded with two headsets from Meta: Project Aria, which is a research prototype of AI glasses, and Quest 3, a virtual-reality headset that has shipped millions of units. Ground-truth poses were obtained by a motion-capture system using small optical markers attached to hands and objects. Hand annotations are provided in the UmeTrack and MANO formats, and objects are represented by 3D meshes with PBR materials obtained by an in-house scanner. In our experiments, we demonstrate the effectiveness of multi-view egocentric data for three popular tasks: 3D hand tracking, model-based 6DoF object pose estimation, and 3D lifting of unknown in-hand objects. The evaluated multi-view methods, whose benchmarking is uniquely enabled by HOT3D, significantly outperform their single-view counterparts.

Problem

Research questions and friction points this paper is trying to address.

Tracking 3D hand and object poses from egocentric multi-view videos

Creating a dataset for diverse real-world interaction scenarios

Improving multi-view methods for 3D hand and object tracking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view RGB/monochrome images for 3D tracking

Motion-capture system with optical markers

In-house scanner for 3D mesh objects

🔎 Similar Papers

HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction

2024-06-10arXiv.orgCitations: 0

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)