Gesture-Aware Pretraining and Token Fusion for 3D Hand Pose Estimation

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This work addresses the challenge of limited accuracy in 3D hand pose estimation from monocular RGB images due to the absence of depth information. The authors propose a two-stage framework: first, a gesture-aware pre-training phase leverages discrete gesture labels to construct a semantically rich embedding space; second, this embedding guides a Token Transformer to regress MANO parameters on a per-joint basis. This is the first approach to incorporate discrete gesture semantics as an inductive bias into 3D hand pose estimation, enabling seamless transfer across architectures without modifying model structure. Combined with a hierarchical objective function encompassing MANO parameters, joint positions, and structural constraints, the method significantly outperforms the EANet baseline on InterHand2.6M, achieving consistent improvements in single-hand pose accuracy and demonstrating strong generalization of the pre-training benefits.

Technology Category

Application Category

📝 Abstract

Estimating 3D hand pose from monocular RGB images is fundamental for applications in AR/VR, human-computer interaction, and sign language understanding. In this work we focus on a scenario where a discrete set of gesture labels is available and show that gesture semantics can serve as a powerful inductive bias for 3D pose estimation. We present a two-stage framework: gesture-aware pretraining that learns an informative embedding space using coarse and fine gesture labels from InterHand2.6M, followed by a per-joint token Transformer guided by gesture embeddings as intermediate representations for final regression of MANO hand parameters. Training is driven by a layered objective over parameters, joints, and structural constraints. Experiments on InterHand2.6M demonstrate that gesture-aware pretraining consistently improves single-hand accuracy over the state-of-the-art EANet baseline, and that the benefit transfers across architectures without any modification.

Problem

Research questions and friction points this paper is trying to address.

3D hand pose estimation

gesture semantics

monocular RGB images

inductive bias

hand gesture

Innovation

Methods, ideas, or system contributions that make the work stand out.

gesture-aware pretraining

token fusion

3D hand pose estimation

Transformer

MANO parameters

🔎 Similar Papers

No similar papers found.

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)