🤖 AI Summary
This work addresses the limitations of existing sign language dialogue systems, which rely on spoken-language text intermediaries and suffer from insufficient large-scale continuous signing data, resulting in restricted vocabulary coverage and poor open-domain generalization. The authors propose a novel paradigm that constructs fluent, grammatically appropriate continuous sign language dialogues by recombining large-scale annotated isolated signing clips, enabling end-to-end training of a 3D sign language response model without dependence on spoken-language transcripts or external annotations. Their approach integrates a retrieval-guided spoken-to-sign transcription module with a BRAID diffusion Transformer to effectively bridge linguistic structural disparities and ensure smooth motion transitions. Evaluation on the newly introduced SignaVox-W and SignaVox-U datasets demonstrates significant improvements in generation quality and semantic alignment, paving the way for scalable, signer-centric visuospatial interaction.
📝 Abstract
Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.