🤖 AI Summary
Existing 3D activity datasets typically annotate only hands or the body in isolation and rely on marker-based motion capture, limiting their ability to model natural hand–body coordination and hindering generalization to marker-free videos. To address this, we introduce the first marker-free, multi-view, high-fidelity 3D hand–body cooperative dataset and propose the first benchmark dedicated to bimanual mechanical collaboration. Our method integrates multi-view triangulation with SMPL-X mesh fitting for accurate 3D pose estimation, and employs a hybrid graph convolutional network with spatiotemporal attention for action recognition. Experiments demonstrate that 3D pose–based reasoning outperforms purely video-based approaches in both efficiency and accuracy. Moreover, joint hand–body modeling significantly improves action recognition performance—validating the critical role of coordinated representation in activity understanding. This work establishes a new foundation for studying embodied, collaborative human activities in unconstrained settings.
📝 Abstract
Bimanual human activities inherently involve coordinated movements of both hands and body. However, the impact of this coordination in activity understanding has not been systematically evaluated due to the lack of suitable datasets. Such evaluation demands kinematic-level annotations (e.g., 3D pose) for the hands and body, yet existing 3D activity datasets typically annotate either hand or body pose. Another line of work employs marker-based motion capture to provide full-body pose, but the physical markers introduce visual artifacts, thereby limiting models' generalization to natural, markerless videos. To address these limitations, we present AssemblyHands-X, the first markerless 3D hand-body benchmark for bimanual activities, designed to study the effect of hand-body coordination for action recognition. We begin by constructing a pipeline for 3D pose annotation from synchronized multi-view videos. Our approach combines multi-view triangulation with SMPL-X mesh fitting, yielding reliable 3D registration of hands and upper body. We then validate different input representations (e.g., video, hand pose, body pose, or hand-body pose) across recent action recognition models based on graph convolution or spatio-temporal attention. Our extensive experiments show that pose-based action inference is more efficient and accurate than video baselines. Moreover, joint modeling of hand and body cues improves action recognition over using hands or upper body alone, highlighting the importance of modeling interdependent hand-body dynamics for a holistic understanding of bimanual activities.