When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

248K/year

🤖 AI Summary

This study addresses the significant performance degradation of audio-visual speech recognition (AVSR) in real-world video conferencing scenarios, where transmission distortions and user-induced exaggerated articulatory behaviors—such as the Lombard effect—adversely impact accuracy. The authors present MLD-VC, the first multimodal dataset specifically designed for video conferencing, and conduct a systematic evaluation of AVSR model failure modes across mainstream platforms. Their analysis reveals that audio distribution shifts caused by speech enhancement algorithms are the primary cause of performance collapse, with these distortions exhibiting acoustic characteristics highly similar to those of the Lombard effect. Building on these insights, the work proposes a targeted optimization strategy through multimodal data collection, effect modeling, and model fine-tuning, achieving an average 17.5% reduction in character error rate (CER) across multiple video conferencing platforms and substantially improving AVSR robustness in practical settings.

Technology Category

Application Category

📝 Abstract

Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.

Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition

Video Conferencing

Performance Degradation

Distribution Shift

Lombard Effect

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Speech Recognition

Video Conferencing

Lombard Effect

Distribution Shift

Multimodal Dataset

🔎 Similar Papers

Maximizing Real-Time Video QoE via Bandwidth Sharing under Markovian setting

2024-01-19arXiv.orgCitations: 0