Leveraging Audio-Visual Data to Reduce the Multilingual Gap in Self-Supervised Speech Models

📅 2025-09-22

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Multilingual self-supervised speech models exhibit substantial performance degradation—particularly in bilingual or low-resource multilingual settings—relative to their monolingual counterparts. To address this, we propose an audio-visual joint modeling framework that incorporates limited visual grounding into mainstream self-supervised learning (SSL) architectures (e.g., wav2vec 2.0 and HuBERT), enabling cross-modal representation alignment. Our key innovation lies in leveraging lightweight visual signals to guide speech representation learning, thereby mitigating cross-lingual interference and enhancing language-invariant features. Experimental results on zero-shot phoneme discrimination in bilingual scenarios demonstrate that our method reduces the performance gap between multilingual and monolingual models from 31.5% to 8.04%, significantly narrowing the modality gap. This work establishes an efficient and viable new paradigm for low-resource multilingual speech representation learning.

Technology Category

Application Category

📝 Abstract

Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.

Problem

Research questions and friction points this paper is trying to address.

Reducing performance gap in multilingual self-supervised speech models

Improving bilingual SSL models with limited visual grounding

Enhancing zero-shot phonetic discrimination using audio-visual data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging audio-visual data for multilingual SSL

Introducing visual grounding into bilingual speech models

Reducing performance gap through multimodal integration

🔎 Similar Papers

No similar papers found.