🤖 AI Summary
Sign language generation suffers from severe error accumulation across multi-stage pipelines (text → pose → video), resulting in significant distortion and limited progress. This paper proposes Stable Signer, an end-to-end hierarchical model that decouples the task into two stages: text understanding and pose-to-video generation. Its core contributions are: (1) the Sign Language Understanding Layer (SLUL), enabling semantically robust text encoding; (2) the Sign Language Pose Mixture-of-Experts (SLP-MoE) network, enhancing gesture rendering diversity and precision; and (3) the semantic-aware Semantic Alignment and Gesture Matching (SAGM) loss, mitigating cross-modal misalignment. The entire model is trained end-to-end under a unified framework, substantially alleviating error propagation. Evaluated on mainstream benchmarks, Stable Signer achieves a 48.6% improvement in generation quality over state-of-the-art methods and supports high-fidelity, multi-style sign language video synthesis.
📝 Abstract
Sign Language Production (SLP) is the process of converting the complex input text into a real video. Most previous works focused on the Text2Gloss, Gloss2Pose, Pose2Vid stages, and some concentrated on Prompt2Gloss and Text2Avatar stages. However, this field has made slow progress due to the inaccuracy of text conversion, pose generation, and the rendering of poses into real human videos in these stages, resulting in gradually accumulating errors. Therefore, in this paper, we streamline the traditional redundant structure, simplify and optimize the task objective, and design a new sign language generative model called Stable Signer. It redefines the SLP task as a hierarchical generation end-to-end task that only includes text understanding (Prompt2Gloss, Text2Gloss) and Pose2Vid, and executes text understanding through our proposed new Sign Language Understanding Linker called SLUL, and generates hand gestures through the named SLP-MoE hand gesture rendering expert block to end-to-end generate high-quality and multi-style sign language videos. SLUL is trained using the newly developed Semantic-Aware Gloss Masking Loss (SAGM Loss). Its performance has improved by 48.6% compared to the current SOTA generation methods.