🤖 AI Summary
Existing methods for 3D hand–object interaction tracking from egocentric video under unconstrained real-world conditions suffer from poor generalization—due to reliance on lab-collected datasets—and low annotation accuracy.
Method: We propose the first markerless, ego-exo multi-view hand tracking system designed for in-the-wild deployment: a lightweight mobile acquisition platform integrating an eight-camera exocentric backpack rig with Meta Quest 3’s stereo egocentric views; and an end-to-end ego-exo collaborative pose estimation framework enabling synchronized multi-view capture, automatic calibration, and high-fidelity 3D reconstruction.
Contribution/Results: We introduce a large-scale, high-quality synchronized multi-view dataset that substantially improves the trade-off between environmental diversity and annotation precision. Experiments demonstrate state-of-the-art 3D hand pose estimation accuracy in complex outdoor scenes and significantly enhanced cross-domain generalization capability.
📝 Abstract
Accurate 3D tracking of hands and their interactions with the world in unconstrained settings remains a significant challenge for egocentric computer vision. With few exceptions, existing datasets are predominantly captured in controlled lab setups, limiting environmental diversity and model generalization. To address this, we introduce a novel marker-less multi-camera system designed to capture precise 3D hands and objects, which allows for nearly unconstrained mobility in genuinely in-the-wild conditions. We combine a lightweight, back-mounted capture rig with eight exocentric cameras, and a user-worn Meta Quest 3 headset, which contributes two egocentric views. We design an ego-exo tracking pipeline to generate accurate 3D hand pose ground truth from this system, and rigorously evaluate its quality. By collecting an annotated dataset featuring synchronized multi-view images and precise 3D hand poses, we demonstrate the capability of our approach to significantly reduce the trade-off between environmental realism and 3D annotation accuracy.