🤖 AI Summary
Multilingual self-supervised speech models exhibit substantial performance degradation—particularly in bilingual or low-resource multilingual settings—relative to their monolingual counterparts. To address this, we propose an audio-visual joint modeling framework that incorporates limited visual grounding into mainstream self-supervised learning (SSL) architectures (e.g., wav2vec 2.0 and HuBERT), enabling cross-modal representation alignment. Our key innovation lies in leveraging lightweight visual signals to guide speech representation learning, thereby mitigating cross-lingual interference and enhancing language-invariant features. Experimental results on zero-shot phoneme discrimination in bilingual scenarios demonstrate that our method reduces the performance gap between multilingual and monolingual models from 31.5% to 8.04%, significantly narrowing the modality gap. This work establishes an efficient and viable new paradigm for low-resource multilingual speech representation learning.
📝 Abstract
Self-supervised learning (SSL) has made significant advances in speech representation learning. Models like wav2vec 2.0 and HuBERT have achieved state-of-the-art results in tasks such as speech recognition, particularly in monolingual settings. However, multilingual SSL models tend to underperform their monolingual counterparts on each individual language, especially in multilingual scenarios with few languages such as the bilingual setting. In this work, we investigate a novel approach to reduce this performance gap by introducing limited visual grounding into bilingual speech SSL models. Our results show that visual grounding benefits both monolingual and bilingual models, with especially pronounced gains for the latter, reducing the multilingual performance gap on zero-shot phonetic discrimination from 31.5% for audio-only models to 8.04% with grounding.