DexAvatar: 3D Sign Language Reconstruction with Hand and Body Pose Priors

📅 2025-12-24

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Sign language generation critically requires biomechanically accurate 3D hand–body joint poses; however, existing datasets predominantly consist of noisy, heavily occluded 2D keypoint annotations from monocular videos, and conventional monocular 3D reconstruction methods suffer from severe degradation under self-occlusion and motion blur. To address this, we propose the first decoupled generative reconstruction framework guided by separate 3D hand and full-body pose priors. Our approach integrates a differentiable human body model (SMPLX), a hand-specific prior network, and multi-scale spatio-temporal graph convolutional modules to achieve fine-grained, biomechanically plausible 3D reconstruction directly from in-the-wild monocular sign language videos. Evaluated on the SGNify benchmark, our method reduces pose estimation errors for both body and hands by 35.11% over state-of-the-art approaches, establishing new performance records.

Technology Category

Application Category

📝 Abstract

The trend in sign language generation is centered around data-driven generative methods that require vast amounts of precise 2D and 3D human pose data to achieve an acceptable generation quality. However, currently, most sign language datasets are video-based and limited to automatically reconstructed 2D human poses (i.e., keypoints) and lack accurate 3D information. Furthermore, existing state-of-the-art for automatic 3D human pose estimation from sign language videos is prone to self-occlusion, noise, and motion blur effects, resulting in poor reconstruction quality. In response to this, we introduce DexAvatar, a novel framework to reconstruct bio-mechanically accurate fine-grained hand articulations and body movements from in-the-wild monocular sign language videos, guided by learned 3D hand and body priors. DexAvatar achieves strong performance in the SGNify motion capture dataset, the only benchmark available for this task, reaching an improvement of 35.11% in the estimation of body and hand poses compared to the state-of-the-art. The official website of this work is: https://github.com/kaustesseract/DexAvatar.

Problem

Research questions and friction points this paper is trying to address.

Reconstructs 3D sign language from monocular videos

Addresses poor 3D pose estimation due to occlusion and noise

Improves accuracy of hand and body movement reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learned 3D hand and body pose priors

Reconstructs fine-grained hand and body movements from monocular videos

Improves pose estimation by 35.11% over state-of-the-art

🔎 Similar Papers

SignMusketeers: An Efficient Multi-Stream Approach for Sign Language Translation at Scale

2024-06-11arXiv.orgCitations: 3

ByteDance

San Jose

Research Scientist Intern, Machine Perception for Input and Interaction (PhD)