UniCUE: Unified Recognition and Generation Framework for Chinese Cued Speech Video-to-Speech Generation

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address error propagation and temporal misalignment caused by text-based intermediate representations in sign language video-to-speech generation, this paper proposes the first end-to-end vision-to-speech cross-modal mapping framework. Methodologically, it introduces a unified recognition-and-generation architecture incorporating a fine-grained semantic alignment pooling module, a VisioPhonetic adapter, and a pose-aware visual processor to jointly model lip movements, manual gestures, and speech acoustic features. Multi-task joint training and cross-task representation alignment are employed, augmented with pose-guided spatiotemporal feature extraction. Evaluated on a newly constructed Chinese sign language video-to-speech dataset, the framework achieves a 78.3% reduction in word error rate and a 32% improvement in lip–speech synchronization, significantly outperforming conventional cascaded “sign language → text → speech” paradigms.

Technology Category

Application Category

📝 Abstract
Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S.
Problem

Research questions and friction points this paper is trying to address.

Direct speech generation from Cued Speech videos lacks accuracy due to limited data.
Existing methods rely on error-prone intermediate text conversion for speech generation.
UniCUE integrates recognition and generation to improve visual-speech alignment and accuracy.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained semantic alignment for visual-speech mapping
VisioPhonetic adapter bridges cross-task representations
Pose-aware processor enhances lip-hand movement correlation
🔎 Similar Papers
No similar papers found.
Jinting Wang
Jinting Wang
Central University of Finance and Economics
Operations ManagementService ScienceQueueing TheoryReliabilityStochastic Modeling
S
Shan Yang
Tencent AI Lab, China
L
Li Liu
The Hong Kong University of Science and Technology (Guangzhou), China