When AVSR Meets Video Conferencing: Dataset, Degradation, and the Hidden Mechanism Behind Performance Collapse

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

232K/year
🤖 AI Summary
This study addresses the significant performance degradation of audio-visual speech recognition (AVSR) in real-world video conferencing scenarios, where transmission distortions and user-induced exaggerated articulatory behaviors—such as the Lombard effect—adversely impact accuracy. The authors present MLD-VC, the first multimodal dataset specifically designed for video conferencing, and conduct a systematic evaluation of AVSR model failure modes across mainstream platforms. Their analysis reveals that audio distribution shifts caused by speech enhancement algorithms are the primary cause of performance collapse, with these distortions exhibiting acoustic characteristics highly similar to those of the Lombard effect. Building on these insights, the work proposes a targeted optimization strategy through multimodal data collection, effect modeling, and model fine-tuning, achieving an average 17.5% reduction in character error rate (CER) across multiple video conferencing platforms and substantially improving AVSR robustness in practical settings.

Technology Category

Application Category

📝 Abstract
Audio-Visual Speech Recognition (AVSR) has achieved remarkable progress in offline conditions, yet its robustness in real-world video conferencing (VC) remains largely unexplored. This paper presents the first systematic evaluation of state-of-the-art AVSR models across mainstream VC platforms, revealing severe performance degradation caused by transmission distortions and spontaneous human hyper-expression. To address this gap, we construct \textbf{MLD-VC}, the first multimodal dataset tailored for VC, comprising 31 speakers, 22.79 hours of audio-visual data, and explicit use of the Lombard effect to enhance human hyper-expression. Through comprehensive analysis, we find that speech enhancement algorithms are the primary source of distribution shift, which alters the first and second formants of audio. Interestingly, we find that the distribution shift induced by the Lombard effect closely resembles that introduced by speech enhancement, which explains why models trained on Lombard data exhibit greater robustness in VC. Fine-tuning AVSR models on MLD-VC mitigates this issue, achieving an average 17.5% reduction in CER across several VC platforms. Our findings and dataset provide a foundation for developing more robust and generalizable AVSR systems in real-world video conferencing. MLD-VC is available at https://huggingface.co/datasets/nccm2p2/MLD-VC.
Problem

Research questions and friction points this paper is trying to address.

Audio-Visual Speech Recognition
Video Conferencing
Performance Degradation
Distribution Shift
Lombard Effect
Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-Visual Speech Recognition
Video Conferencing
Lombard Effect
Distribution Shift
Multimodal Dataset
Y
Yihuan Huang
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
J
Jun Xue
School of Cyber Science and Engineering, Wuhan University
L
Liu Jiajun
School of Cyber Science and Engineering, Wuhan University
D
Daixian Li
School of Cyber Science and Engineering, Wuhan University
Tong Zhang
Tong Zhang
Professor of GIS/Remote Sensing, Wuhan University
GeoAImachine learningtransport geography
Z
Zhuolin Yi
School of Cyber Science and Engineering, Wuhan University
Y
Yanzhen Ren
Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education; School of Cyber Science and Engineering, Wuhan University
K
Kai Li
Tsinghua University