🤖 AI Summary
This study addresses the significant performance degradation of audio-visual speech recognition (AVSR) in real-world video conferencing scenarios, where transmission distortions and user-induced exaggerated articulatory behaviors—such as the Lombard effect—adversely impact accuracy. The authors present MLD-VC, the first multimodal dataset specifically designed for video conferencing, and conduct a systematic evaluation of AVSR model failure modes across mainstream platforms. Their analysis reveals that audio distribution shifts caused by speech enhancement algorithms are the primary cause of performance collapse, with these distortions exhibiting acoustic characteristics highly similar to those of the Lombard effect. Building on these insights, the work proposes a targeted optimization strategy through multimodal data collection, effect modeling, and model fine-tuning, achieving an average 17.5% reduction in character error rate (CER) across multiple video conferencing platforms and substantially improving AVSR robustness in practical settings.
📝 Abstract
Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.