Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for 3D hand–object interaction tracking from egocentric video under unconstrained real-world conditions suffer from poor generalization—due to reliance on lab-collected datasets—and low annotation accuracy. Method: We propose the first markerless, ego-exo multi-view hand tracking system designed for in-the-wild deployment: a lightweight mobile acquisition platform integrating an eight-camera exocentric backpack rig with Meta Quest 3’s stereo egocentric views; and an end-to-end ego-exo collaborative pose estimation framework enabling synchronized multi-view capture, automatic calibration, and high-fidelity 3D reconstruction. Contribution/Results: We introduce a large-scale, high-quality synchronized multi-view dataset that substantially improves the trade-off between environmental diversity and annotation precision. Experiments demonstrate state-of-the-art 3D hand pose estimation accuracy in complex outdoor scenes and significantly enhanced cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract
Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.
Problem

Research questions and friction points this paper is trying to address.

Tracking 3D hands in unconstrained real-world settings
Overcoming limited environmental diversity in lab datasets
Reducing trade-off between realism and annotation accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile multi-camera rig captures 3D hands
Combines exocentric cameras with egocentric headset views
Generates accurate 3D hand poses in wild conditions
🔎 Similar Papers
Patrick Rim
Patrick Rim
Yale University
Machine LearningMultimodal 3D VisionPerceptionReconstructionEmbodied AI
K
Kun He
Meta Reality Labs
K
Kevin Harris
Meta Reality Labs
B
Braden Copple
Meta Reality Labs
S
Shangchen Han
Meta Reality Labs
S
Sizhe An
Meta Reality Labs
Ivan Shugurov
Ivan Shugurov
Technische Universität München
machine learningcomputer vision
T
Tomas Hodan
Meta Reality Labs
H
He Wen
Meta Reality Labs
X
Xu Xie
Meta Reality Labs