Toward Phonology-Guided Sign Language Motion Generation: A Diffusion Baseline and Conditioning Analysis

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Generating natural, accurate, and visually fluent 3D sign language motions from text remains a significant challenge. This work proposes the first sign language generation framework that explicitly incorporates phonological properties—such as handshape, location, and movement—into a diffusion model based on the MDM architecture and SMPL-X representation, leveraging both CLIP and T5 text encoders. The study systematically investigates the impact of different conditioning input formats, including lexical tokens, lexical tokens augmented with phonological attributes, and symbolic versus natural language descriptions. It finds that translating symbolic annotations into natural language is crucial for effective CLIP encoding and introduces an independent-path encoding structure to enhance generation quality. The proposed method substantially outperforms the current state-of-the-art SignAvatar in metrics such as lexical distinguishability, demonstrating the critical role of input representation in phonologically conditioned sign language generation.

Technology Category

Application Category

📝 Abstract
Generating natural, correct, and visually smooth 3D avatar sign language motion conditioned on the text inputs continues to be very challenging. In this work, we train a generative model of 3D body motion and explore the role of phonological attribute conditioning for sign language motion generation, using ASL-LEX 2.0 annotations such as hand shape, hand location and movement. We first establish a strong diffusion baseline using an Human Motion MDM-style diffusion model with SMPL-X representation, which outperforms SignAvatar, a state-of-the-art CVAE method, on gloss discriminability metrics. We then systematically study the role of text conditioning using different text encoders (CLIP vs. T5), conditioning modes (gloss-only vs. gloss+phonological attributes), and attribute notation format (symbolic vs. natural language). Our analysis reveals that translating symbolic ASL-LEX notations to natural language is a necessary condition for effective CLIP-based attribute conditioning, while T5 is largely unaffected by this translation. Furthermore, our best-performing variant (CLIP with mapped attributes) outperforms SignAvatar across all metrics. These findings highlight input representation as a critical factor for text-encoder-based attribute conditioning, and motivate structured conditioning approaches where gloss and phonological attributes are encoded through independent pathways.
Problem

Research questions and friction points this paper is trying to address.

sign language motion generation
phonological conditioning
3D avatar animation
text-to-motion
ASL-LEX
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion model
phonological conditioning
sign language motion generation
text-to-motion
ASL-LEX