🤖 AI Summary
Existing hand–object interaction datasets are predominantly collected in controlled environments, limiting their ability to support model generalization in real-world, complex scenarios. To address this gap, this work proposes a lightweight, markerless multi-camera system integrated with a user-worn VR headset to synchronously capture high-fidelity 3D hand–object interaction data across diverse in-the-wild settings. By leveraging a backpack-mounted multi-camera array, synchronized calibration with the VR headset, and a novel ego-exo joint tracking pipeline, the study achieves the first large-scale acquisition and annotation of high-precision 3D hand–object interactions in authentic outdoor environments. The resulting dataset, SHOW3D—the first in-the-wild 3D hand–object interaction benchmark—effectively mitigates the longstanding trade-off between environmental realism and annotation accuracy, significantly enhancing model generalization across multiple downstream tasks.
📝 Abstract
Accurate 3D understanding of human hands and objects during manipulation remains a significant challenge for egocentric computer vision. Existing hand-object interaction datasets are predominantly captured in controlled studio settings, which limits both environmental diversity and the ability of models trained on such data to generalize to real-world scenarios. To address this challenge, we introduce a novel marker-less multi-camera system that allows for nearly unconstrained mobility in genuinely in-the-wild conditions, while still having the ability to generate precise 3D annotations of hands and objects. The capture system consists of a lightweight, back-mounted, multi-camera rig that is synchronized and calibrated with a user-worn VR headset. For 3D ground-truth annotation of hands and objects, we develop an ego-exo tracking pipeline and rigorously evaluate its quality. Finally, we present SHOW3D, the first large-scale dataset with 3D annotations that show hands interacting with objects in diverse real-world environments, including outdoor settings. Our approach significantly reduces the fundamental trade-off between environmental realism and accuracy of 3D annotations, which we validate with experiments on several downstream tasks. show3d-dataset.github.io