Evaluation of End-to-End Continuous Spanish Lipreading in Different Data Conditions

📅 2025-02-01
📈 Citations: 0
✨ Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the limited robustness of end-to-end Spanish continuous visual speech recognition (lip-reading) under data scarcity, acoustic noise, and cross-speaker variability. To this end, we propose the first end-to-end Spanish lip-reading system, featuring a low-resource-adapted temporal modeling strategy based on a Transformer architecture. Our approach integrates joint CTC–attention decoding, visual feature enhancement, and synthetic data augmentation. We conduct the first systematic evaluation of Spanish lip-reading generalization across diverse realistic conditions—including visual ambiguity, inter-speaker articulatory variation, and silent frames. On the Spanish-LRS benchmark, our model achieves a word error rate (WER) of 38.2%, representing a significant 9.7% absolute reduction over the baseline. Notably, it maintains structural modeling capability for speech streams even in few-shot settings, demonstrating strong adaptability to challenging visual conditions.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

Visual Speech Recognition
Spanish Lip-reading
Accuracy Under Various Conditions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Spanish Lip-Reading
Hybrid CTC/Attention Architecture
State-of-the-Art Performance