🤖 AI Summary
This work addresses the challenge of applying audio-visual speech recognition (AVSR) to low-resource languages, which typically lack annotated audio-visual corpora. The authors propose a novel approach that requires no real audio-visual data: by leveraging lip-syncing technology, they synthesize lip-motion videos from static face images paired with real speech, thereby constructing synthetic training data. This synthetic dataset is then used to fine-tune a pre-trained AV-HuBERT model for multimodal speech recognition. Remarkably, this method achieves high-performance AVSR under a zero real audio-visual resource setting, nearly matching state-of-the-art results on a Catalan test set and significantly outperforming audio-only models. Furthermore, the system retains its multimodal advantage even in noisy conditions, effectively eliminating the dependency on annotated video corpora.
📝 Abstract
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.