🤖 AI Summary
This work addresses zero-shot audio-visual speech recognition (AVSR): recognizing speech in a target language without any audio-visual training data for that language. We propose AV-Romanizer, a framework that learns language-agnostic joint audio-visual representations and directly predicts Romanized transcriptions of speech. To support this, we introduce MARC, a large-scale multilingual audio-visual Romanization corpus comprising 2,916 hours across 82 languages. We further design a unified fine-tuning architecture that end-to-end jointly optimizes audio-visual encoders with a large language model (LLM), while enabling mapping from Romanized outputs to native orthographies. Key contributions include: the first zero-shot AVSR method; a novel mechanism for learning language-agnostic phonetic representations; a multi-task adapter-based fine-tuning strategy; and the first large-scale multilingual Romanization benchmark for AVSR. Experiments demonstrate substantial improvements in zero-shot recognition performance on unseen languages, significantly expanding the linguistic coverage of AVSR.
📝 Abstract
We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.