Ego-Exo 3D Hand Tracking in the Wild with a Mobile Multi-Camera Rig

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing methods for 3D hand–object interaction tracking from egocentric video under unconstrained real-world conditions suffer from poor generalization—due to reliance on lab-collected datasets—and low annotation accuracy. Method: We propose the first markerless, ego-exo multi-view hand tracking system designed for in-the-wild deployment: a lightweight mobile acquisition platform integrating an eight-camera exocentric backpack rig with Meta Quest 3’s stereo egocentric views; and an end-to-end ego-exo collaborative pose estimation framework enabling synchronized multi-view capture, automatic calibration, and high-fidelity 3D reconstruction. Contribution/Results: We introduce a large-scale, high-quality synchronized multi-view dataset that substantially improves the trade-off between environmental diversity and annotation precision. Experiments demonstrate state-of-the-art 3D hand pose estimation accuracy in complex outdoor scenes and significantly enhanced cross-domain generalization capability.

Technology Category

Application Category

📝 Abstract

Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.

Problem

Research questions and friction points this paper is trying to address.

Tracking 3D hands in unconstrained real-world settings

Overcoming limited environmental diversity in lab datasets

Reducing trade-off between realism and annotation accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mobile multi-camera rig captures 3D hands

Combines exocentric cameras with egocentric headset views

Generates accurate 3D hand poses in wild conditions

🔎 Similar Papers

WiLoR: End-to-end 3D Hand Localization and Reconstruction in-the-wild

2024-09-18arXiv.orgCitations: 8

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)