Ham2Pose: Animating Sign Language Notation into Pose Sequences

📅 2022-11-24

🏛️ Computer Vision and Pattern Recognition

📈 Citations: 17

✨ Influential: 1

career value

201K/year

🤖 AI Summary

This paper addresses the end-to-end generation of continuous 3D human pose sequences from HamNoSys sign language notation—a long-standing challenge in sign language animation. We propose the first learnable text-to-pose framework enabling cross-lingual sign animation synthesis and barrier-free communication between Deaf and hearing individuals. Methodologically, we design a Transformer-based spatiotemporal joint encoder; introduce a weakly supervised training strategy robust to missing keypoints and noisy annotations; and propose DTW-MJE—a novel dynamic time warping–based metric incorporating mean joint error—to significantly improve robustness and accuracy in pose sequence alignment. Experiments on the AUTSL dataset demonstrate that our method outperforms existing baselines in generation fidelity, with DTW-MJE exhibiting superior discriminative power over conventional metrics (e.g., MPJPE, DTW-L2). All code, preprocessing pipelines, and trained models are publicly released.

📝 Abstract

Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal by design, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement that considers missing keypoints, to measure the distance between pose sequences using DTW-MJE. We validate its correctness using AUTSL, a large-scale Sign language dataset, show that it measures the distance between pose sequences more accurately than existing measurements, and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research.

Problem

Research questions and friction points this paper is trying to address.

Translate HamNoSys notation into sign language poses

Generate pose sequences using transformer encoders

Propose nDTW for accurate pose sequence distance measurement

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformers encode text and pose spatiotemporal info

Weak supervision trains with partial inaccurate data

nDTW measures pose sequence distance more accurately

🔎 Similar Papers

An Efficient Sign Language Translation Using Spatial Configuration and Motion Dynamics with LLMs