Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data

📅 2026-03-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of applying audio-visual speech recognition (AVSR) to low-resource languages, which typically lack annotated audio-visual corpora. The authors propose a novel approach that requires no real audio-visual data: by leveraging lip-syncing technology, they synthesize lip-motion videos from static face images paired with real speech, thereby constructing synthetic training data. This synthetic dataset is then used to fine-tune a pre-trained AV-HuBERT model for multimodal speech recognition. Remarkably, this method achieves high-performance AVSR under a zero real audio-visual resource setting, nearly matching state-of-the-art results on a Catalan test set and significantly outperforming audio-only models. Furthermore, the system retains its multimodal advantage even in noisy conditions, effectively eliminating the dependency on annotated video corpora.

Technology Category

Application Category

📝 Abstract
Audiovisual speech recognition (AVSR) combines acoustic and visual cues to improve transcription robustness under challenging conditions but remains out of reach for most under-resourced languages due to the lack of labeled video corpora for training. We propose a zero-AV-resource AVSR framework that relies on synthetic visual streams generated by lip-syncing static facial images with real audio. We first evaluate synthetic visual augmentation on Spanish benchmarks, then apply it to Catalan, a language with no annotated audiovisual corpora. We synthesize over 700 hours of talking-head video and fine-tune a pre-trained AV-HuBERT model. On a manually annotated Catalan benchmark, our model achieves near state-of-the-art performance with much fewer parameters and training data, outperforms an identically trained audio-only baseline, and preserves multimodal advantages in noise. Scalable synthetic video thus offers a viable substitute for real recordings in zero-AV-resource AVSR.
Problem

Research questions and friction points this paper is trying to address.

audiovisual speech recognition
zero-resource
low-resource languages
synthetic visual data
labeled video corpora
Innovation

Methods, ideas, or system contributions that make the work stand out.

synthetic visual data
zero-AV-resource
audiovisual speech recognition
lip-syncing
AV-HuBERT
🔎 Similar Papers
No similar papers found.
P
Pol Buitrago
Barcelona Supercomputing Center (BSC), Spain; Universitat Politècnica de Catalunya (UPC), Spain
P
Pol Gàlvez
Barcelona Supercomputing Center (BSC), Spain; Universitat Politècnica de Catalunya (UPC), Spain
Oriol Pareras
Oriol Pareras
Research Engineer, Barcelona Supercomputing Center
Natural Language ProcessingMultimodalityDeep Learning
Javier Hernando
Javier Hernando
Professor of Electrionic Egineering, Universitat Politecnica de Catalunya
Speech ProcessingBiometrics