GLAP: General contrastive audio-text pretraining across domains and languages

📅 2025-06-12

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Existing CLAP methods are limited to English speech–audio retrieval and lack support for multilingual and cross-domain audio understanding. This paper introduces the first cross-lingual and cross-domain general audio–text contrastive pretraining framework, unifying representation learning for diverse audio modalities—including speech, music, and sound events—and textual inputs across 50 languages. Methodologically, we adopt a dual-tower contrastive architecture integrating a multilingual text encoder, a domain-adaptive audio encoder, and a cross-domain negative sampling strategy—enabling, for the first time, joint alignment of multilingual speech content and cross-domain audio semantics. Experiments demonstrate state-of-the-art performance on Clotho and AudioCaps retrieval benchmarks; significant gains in zero-shot sound event classification and cross-lingual keyword spotting; and substantial improvements over baselines on sound–music understanding across four languages.

Technology Category

Application Category

📝 Abstract

Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally, multilingual sound and music understanding is evaluated across four languages. Checkpoints and Source: https://github.com/xiaomi-research/dasheng-glap.

Problem

Research questions and friction points this paper is trying to address.

Expands CLAP to multilingual and multi-domain audio-text tasks

Improves performance in speech retrieval and classification tasks

Enhances multilingual sound and music understanding across languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multilingual audio-text pretraining expansion

Enhanced speech retrieval and classification

Superior multilingual keyword spotting

🔎 Similar Papers

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations