Cued Speech Generation Leveraging a Pre-trained Audiovisual Text-to-Speech Model

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the speech comprehension needs of deaf and hard-of-hearing individuals by proposing the first end-to-end text-to-Cued Speech (CS) generation method, directly mapping input text to temporally synchronized handshape and mouth movement sequences without relying on intermediate phonetic or acoustic representations. Methodologically, we adapt the pretrained audiovisual speech synthesis model AVTacotron2 to the CS generation task via cross-modal transfer learning, integrating autoregressive sequence modeling with a carefully constructed, multi-source dataset featuring precise visual-linguistic alignment. Our key contribution is establishing an audio-agnostic visual speech generation paradigm that explicitly models hand–mouth articulatory coordination. Evaluated on two public CS datasets, our approach achieves 77% phoneme-level decoding accuracy—significantly outperforming existing baselines—and demonstrates both the feasibility and effectiveness of this paradigm for Cued Speech synthesis.

Technology Category

Application Category

📝 Abstract

This paper presents a novel approach for the automatic generation of Cued Speech (ACSG), a visual communication system used by people with hearing impairment to better elicit the spoken language. We explore transfer learning strategies by leveraging a pre-trained audiovisual autoregressive text-to-speech model (AVTacotron2). This model is reprogrammed to infer Cued Speech (CS) hand and lip movements from text input. Experiments are conducted on two publicly available datasets, including one recorded specifically for this study. Performance is assessed using an automatic CS recognition system. With a decoding accuracy at the phonetic level reaching approximately 77%, the results demonstrate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Cued Speech Generation

Visual Gestures

Hearing Impaired Communication

Innovation

Methods, ideas, or system contributions that make the work stand out.

AVTacotron2

Cued Speech

Automatic Text-to-Speech Conversion

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs