Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation

📅 2025-01-23
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Real-world audio-visual speech recognition (AVSR) suffers significant performance degradation under cross-modal asynchrony caused by practical degradations—e.g., occluded lips or noisy audio. To address this, we propose CAV2vec, a self-supervised framework for robust AVSR. Our method introduces two key innovations: (1) a novel unimodal multi-task corruption prediction mechanism—predicting clean visual features from corrupted audio and clean acoustic features from corrupted video—to mitigate representation space fragmentation induced by modality-specific degradations; and (2) joint optimization of robust audio-visual representations via teacher-student distillation coupled with explicit cross-modal synchronization modeling. Evaluated on multiple standard robust AVSR benchmarks, CAV2vec achieves substantial improvements in recognition accuracy across diverse audio and video corruptions, demonstrating strong generalization capability. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
Audio-visual speech recognition (AVSR) incorporates auditory and visual modalities to improve recognition accuracy, particularly in noisy environments where audio-only speech systems are insufficient. While previous research has largely addressed audio disruptions, few studies have dealt with visual corruptions, e.g., lip occlusions or blurred videos, which are also detrimental. To address this real-world challenge, we propose CAV2vec, a novel self-supervised speech representation learning framework particularly designed to handle audio-visual joint corruption. CAV2vec employs a self-distillation approach with a corrupted prediction task, where the student model learns to predict clean targets, generated by the teacher model, with corrupted input frames. Specifically, we suggest a unimodal multi-task learning, which distills cross-modal knowledge and aligns the corrupted modalities, by predicting clean audio targets with corrupted videos, and clean video targets with corrupted audios. This strategy mitigates the dispersion in the representation space caused by corrupted modalities, leading to more reliable and robust audio-visual fusion. Our experiments on robust AVSR benchmarks demonstrate that the corrupted representation learning method significantly enhances recognition accuracy across generalized environments involving various types of corruption. Our code is available at https://github.com/sungnyun/cav2vec.
Problem

Research questions and friction points this paper is trying to address.

Handling audio-visual joint corruption in speech recognition
Improving robustness against visual corruptions like lip occlusions
Aligning corrupted modalities for reliable audio-visual fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distillation with corrupted prediction task
Unimodal multi-task learning for cross-modal knowledge
Aligns corrupted modalities via clean target prediction
🔎 Similar Papers
No similar papers found.