🤖 AI Summary
Real-world audio-visual speech recognition (AVSR) suffers significant performance degradation under cross-modal asynchrony caused by practical degradations—e.g., occluded lips or noisy audio. To address this, we propose CAV2vec, a self-supervised framework for robust AVSR. Our method introduces two key innovations: (1) a novel unimodal multi-task corruption prediction mechanism—predicting clean visual features from corrupted audio and clean acoustic features from corrupted video—to mitigate representation space fragmentation induced by modality-specific degradations; and (2) joint optimization of robust audio-visual representations via teacher-student distillation coupled with explicit cross-modal synchronization modeling. Evaluated on multiple standard robust AVSR benchmarks, CAV2vec achieves substantial improvements in recognition accuracy across diverse audio and video corruptions, demonstrating strong generalization capability. The implementation is publicly available.
📝 Abstract
Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.