🤖 AI Summary
Existing sign language generation systems suffer from inaccurate syntactic translation, inadequate modeling of non-manual markers (e.g., facial expressions, head pose, body tilt), and low video fidelity. This paper introduces the first end-to-end framework for generating American Sign Language (ASL) videos directly from English text. Our approach features a novel dual-path architecture that synergistically integrates a large language model (LLM) with a skeleton-driven diffusion-based video generator: the LLM explicitly parses semantics and infers non-manual markers, while the video model synthesizes high-fidelity signer videos from joint-level pose sequences. To our knowledge, this is the first work to systematically model paralinguistic information and achieve decoupled yet coordinated generation of manual and non-manual components. A user study with 30 Deaf participants demonstrates a 37% improvement in comprehension accuracy, 89% grammatical correctness, and state-of-the-art performance in motion fluency and visual fidelity.
📝 Abstract
Sign languages are essential for the Deaf and Hard-of-Hearing (DHH) community. Sign language generation systems have the potential to support communication by translating from written languages, such as English, into signed videos. However, current systems often fail to meet user needs due to poor translation of grammatical structures, the absence of facial cues and body language, and insufficient visual and motion fidelity. We address these challenges by building on recent advances in LLMs and video generation models to translate English sentences into natural-looking AI ASL signers. The text component of our model extracts information for manual and non-manual components of ASL, which are used to synthesize skeletal pose sequences and corresponding video frames. Our findings from a user study with 30 DHH participants and thorough technical evaluations demonstrate significant progress and identify critical areas necessary to meet user needs.