🤖 AI Summary
To address error propagation and temporal misalignment caused by text-based intermediate representations in sign language video-to-speech generation, this paper proposes the first end-to-end vision-to-speech cross-modal mapping framework. Methodologically, it introduces a unified recognition-and-generation architecture incorporating a fine-grained semantic alignment pooling module, a VisioPhonetic adapter, and a pose-aware visual processor to jointly model lip movements, manual gestures, and speech acoustic features. Multi-task joint training and cross-task representation alignment are employed, augmented with pose-guided spatiotemporal feature extraction. Evaluated on a newly constructed Chinese sign language video-to-speech dataset, the framework achieves a 78.3% reduction in word error rate and a 32% improvement in lip–speech synchronization, significantly outperforming conventional cascaded “sign language → text → speech” paradigms.
📝 Abstract
Cued Speech (CS) enhances lipreading through hand coding, providing precise speech perception support for the hearing-impaired. CS Video-to-Speech generation (CSV2S) task aims to convert the CS visual expressions (CS videos) of hearing-impaired individuals into comprehensible speech signals. Direct generation of speech from CS video (called single CSV2S) yields poor performance due to insufficient CS data. Current research mostly focuses on CS Recognition (CSR), which convert video content into linguistic text. Based on this, one straightforward way of CSV2S is to combine CSR with a Text-to-Speech system. This combined architecture relies on text as an intermediate medium for stepwise cross-modal alignment, which may lead to error propagation and temporal misalignment between speech and video dynamics. To address these challenges, we propose a novel approach that directly generates speech from CS videos without relying on intermediate text. Building upon this, we propose UniCUE, the first unified framework for CSV2S, whose core innovation lies in the integration of the CSR task that provides fine-grained visual-semantic information to facilitate speech generation from CS videos. More precisely, (1) a novel fine-grained semantic alignment pool to ensure precise mapping between visual features and speech contents; (2) a VisioPhonetic adapter to bridge cross-task representations, ensuring seamless compatibility between two distinct tasks (i.e., CSV2S and CSR); (3) a pose-aware visual processor is introduced to enhance fine-grained spatiotemporal correlations between lip and hand movements in CS video. Experiments on our new established Chinese CS dataset (14 cuers1: 8 hearing-impaired and 6 normal-hearing) show that our UniCUE significantly reduces Word Error Rate by 78.3% and improves lip-speech synchronization by 32% compared to the single CSV2S.