🤖 AI Summary
This work addresses the speech comprehension needs of deaf and hard-of-hearing individuals by proposing the first end-to-end text-to-Cued Speech (CS) generation method, directly mapping input text to temporally synchronized handshape and mouth movement sequences without relying on intermediate phonetic or acoustic representations. Methodologically, we adapt the pretrained audiovisual speech synthesis model AVTacotron2 to the CS generation task via cross-modal transfer learning, integrating autoregressive sequence modeling with a carefully constructed, multi-source dataset featuring precise visual-linguistic alignment. Our key contribution is establishing an audio-agnostic visual speech generation paradigm that explicitly models hand–mouth articulatory coordination. Evaluated on two public CS datasets, our approach achieves 77% phoneme-level decoding accuracy—significantly outperforming existing baselines—and demonstrates both the feasibility and effectiveness of this paradigm for Cued Speech synthesis.
📝 Abstract
This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.