🤖 AI Summary
This work addresses the challenge of limited accuracy in 3D hand pose estimation from monocular RGB images due to the absence of depth information. The authors propose a two-stage framework: first, a gesture-aware pre-training phase leverages discrete gesture labels to construct a semantically rich embedding space; second, this embedding guides a Token Transformer to regress MANO parameters on a per-joint basis. This is the first approach to incorporate discrete gesture semantics as an inductive bias into 3D hand pose estimation, enabling seamless transfer across architectures without modifying model structure. Combined with a hierarchical objective function encompassing MANO parameters, joint positions, and structural constraints, the method significantly outperforms the EANet baseline on InterHand2.6M, achieving consistent improvements in single-hand pose accuracy and demonstrating strong generalization of the pre-training benefits.
📝 Abstract
Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.