🤖 AI Summary
Existing sign language production (SLP) models suffer from limited signer diversity, poor visual realism, and inadequate modeling of non-manual features (e.g., facial expressions). To address these issues, we propose an end-to-end latent diffusion framework that explicitly disentangles manual (hand gestures) and non-manual (facial expressions, head pose) modalities, and introduces a novel multimodal feature aggregation module to enable their joint, coherent modeling. Furthermore, we incorporate a reference-image-guided generation mechanism to support controllable synthesis across diverse signer identities—including cross-ethnic and multi-appearance subjects. Evaluated on YouTube-SL-25, our method achieves significant improvements in visual quality (FID ↓18.3%, LPIPS ↓12.7%) and linguistic fidelity. Notably, it is the first SLP approach to generate high-diversity, photorealistic sign videos while fully preserving grammatical integrity—establishing a new paradigm for embodied sign language interaction.
📝 Abstract
The diversity of sign representation is essential for Sign Language Production (SLP) as it captures variations in appearance, facial expressions, and hand movements. However, existing SLP models are often unable to capture diversity while preserving visual quality and modelling non-manual attributes such as emotions. To address this problem, we propose a novel approach that leverages Latent Diffusion Model (LDM) to synthesise photorealistic digital avatars from a generated reference image. We propose a novel sign feature aggregation module that explicitly models the non-manual features ( extit{e.g.}, the face) and the manual features ( extit{e.g.}, the hands). We show that our proposed module ensures the preservation of linguistic content while seamlessly using reference images with different ethnic backgrounds to ensure diversity. Experiments on the YouTube-SL-25 sign language dataset show that our pipeline achieves superior visual quality compared to state-of-the-art methods, with significant improvements on perceptual metrics.