Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production

πŸ“… 2025-09-13
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing sign language production (SLP) methods rely on gloss as an intermediate linguistic representation, suffering from language specificity and severe scarcity of gloss annotations, which hinders generalization. To address this, we propose Text2SignDiffβ€”the first gloss-free, end-to-end text-to-sign generation framework, built upon a non-autoregressive latent diffusion model. Our approach introduces a cross-modal alignment module that constructs a unified latent space jointly embedding textual semantics and sign pose dynamics, thereby eliminating reliance on gloss and mitigating error propagation. The architecture integrates a text encoder with a pose decoder to directly map spoken-language text to temporally coherent 3D sign pose sequences. Evaluated on PHOENIX14T and How2Sign, Text2SignDiff achieves state-of-the-art performance, significantly improving generation accuracy, temporal smoothness, and contextual consistency. This work advances robust, scalable digital communication support for Deaf and hard-of-hearing communities.

Technology Category

Application Category

πŸ“ Abstract
Sign language production (SLP) aims to translate spoken language sentences into a sequence of pose frames in a sign language, bridging the communication gap and promoting digital inclusion for deaf and hard-of-hearing communities. Existing methods typically rely on gloss, a symbolic representation of sign language words or phrases that serves as an intermediate step in SLP. This limits the flexibility and generalization of SLP, as gloss annotations are often unavailable and language-specific. Therefore, we present a novel diffusion-based generative approach - Text2Sign Diffusion (Text2SignDiff) for gloss-free SLP. Specifically, a gloss-free latent diffusion model is proposed to generate sign language sequences from noisy latent sign codes and spoken text jointly, reducing the potential error accumulation through a non-autoregressive iterative denoising process. We also design a cross-modal signing aligner that learns a shared latent space to bridge visual and textual content in sign and spoken languages. This alignment supports the conditioned diffusion-based process, enabling more accurate and contextually relevant sign language generation without gloss. Extensive experiments on the commonly used PHOENIX14T and How2Sign datasets demonstrate the effectiveness of our method, achieving the state-of-the-art performance.
Problem

Research questions and friction points this paper is trying to address.

Generating sign language sequences without gloss annotations
Bridging visual and textual modalities for accurate translation
Eliminating error accumulation in non-autoregressive sign production
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gloss-free latent diffusion model
Cross-modal signing aligner design
Non-autoregressive iterative denoising process
πŸ”Ž Similar Papers
No similar papers found.
L
Liqian Feng
School of Computer Science, The University of Sydney, Camperdown, NSW, Australia
Lintao Wang
Lintao Wang
The University of Sydney
character animationhuman motion understanding and generationlarge language modelai4science
K
Kun Hu
School of Science, Edith Cowan University, Joondalup, WA, Australia
D
Dehui Kong
Faculty of Information Technology, Beijing University of Technology, Beijing, China
Z
Zhiyong Wang
School of Computer Science, The University of Sydney, Camperdown, NSW, Australia