Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses the challenge of applying audio-visual speech recognition (AVSR) to low-resource languages, which typically lack annotated audio-visual corpora. The authors propose a novel approach that requires no real audio-visual data: by leveraging lip-syncing technology, they synthesize lip-motion videos from static face images paired with real speech, thereby constructing synthetic training data. This synthetic dataset is then used to fine-tune a pre-trained AV-HuBERT model for multimodal speech recognition. Remarkably, this method achieves high-performance AVSR under a zero real audio-visual resource setting, nearly matching state-of-the-art results on a Catalan test set and significantly outperforming audio-only models. Furthermore, the system retains its multimodal advantage even in noisy conditions, effectively eliminating the dependency on annotated video corpora.

Technology Category

Application Category

📝 Abstract

Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.

Problem

Research questions and friction points this paper is trying to address.

audiovisual speech recognition

zero-resource

low-resource languages

synthetic visual data

labeled video corpora

Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic visual data

zero-AV-resource

audiovisual speech recognition