Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations

📅 2025-03-08

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses zero-shot audio-visual speech recognition (AVSR): recognizing speech in a target language without any audio-visual training data for that language. We propose AV-Romanizer, a framework that learns language-agnostic joint audio-visual representations and directly predicts Romanized transcriptions of speech. To support this, we introduce MARC, a large-scale multilingual audio-visual Romanization corpus comprising 2,916 hours across 82 languages. We further design a unified fine-tuning architecture that end-to-end jointly optimizes audio-visual encoders with a large language model (LLM), while enabling mapping from Romanized outputs to native orthographies. Key contributions include: the first zero-shot AVSR method; a novel mechanism for learning language-agnostic phonetic representations; a multi-task adapter-based fine-tuning strategy; and the first large-scale multilingual Romanization benchmark for AVSR. Experiments demonstrate substantial improvements in zero-shot recognition performance on unseen languages, significantly expanding the linguistic coverage of AVSR.

Technology Category

Application Category

📝 Abstract

We explore a novel zero-shot Audio-Visual Speech Recognition (AVSR) framework, dubbed Zero-AVSR, which enables speech recognition in target languages without requiring any audio-visual speech data in those languages. Specifically, we introduce the Audio-Visual Speech Romanizer (AV-Romanizer), which learns language-agnostic speech representations by predicting Roman text. Then, by leveraging the strong multilingual modeling capabilities of Large Language Models (LLMs), we propose converting the predicted Roman text into language-specific graphemes, forming the proposed Cascaded Zero-AVSR. Taking it a step further, we explore a unified Zero-AVSR approach by directly integrating the audio-visual speech representations encoded by the AV-Romanizer into the LLM. This is achieved through finetuning the adapter and the LLM using our proposed multi-task learning scheme. To capture the wide spectrum of phonetic and linguistic diversity, we also introduce a Multilingual Audio-Visual Romanized Corpus (MARC) consisting of 2,916 hours of audio-visual speech data across 82 languages, along with transcriptions in both language-specific graphemes and Roman text. Extensive analysis and experiments confirm that the proposed Zero-AVSR framework has the potential to expand language support beyond the languages seen during the training of the AV-Romanizer.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot speech recognition without target language data

Language-agnostic speech representations via Roman text prediction

Multilingual AVSR using LLMs and a diverse speech corpus

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-agnostic speech representations via AV-Romanizer

Cascaded Zero-AVSR using LLMs for multilingual conversion

Unified Zero-AVSR with multi-task learning integration

🔎 Similar Papers

Large Language Models Are Strong Audio-Visual Speech Recognition Learners