Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

📅 2025-11-20

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Audio-language pretraining suffers from insufficient large-scale high-quality data, limited caption diversity, and a lack of systematic evaluation—hindering progress toward general-purpose audio understanding. To address these challenges, we introduce CaptionStew, a multi-style audio-text dataset, and conduct the first systematic comparative study of contrastive versus generative caption modeling for cross-domain representation learning across speech, music, and environmental sounds. Leveraging data augmentation and large-scale joint pretraining, we develop a unified audio encoder. Experiments demonstrate substantial gains in transfer performance on diverse downstream tasks, with particularly pronounced improvements in low-resource settings. We fully open-source our data curation pipelines, training code, and pretrained models—establishing critical infrastructure and an empirical benchmark for audio-language foundation model research.

Technology Category

Application Category

📝 Abstract

Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing audio-language models primarily excel at retrieval tasks with limited adoption as general-purpose encoders. We identify three key barriers: limited large-scale audio-text corpora, insufficient caption diversity, and lack of systematic exploration and evaluation. To this end, we introduce CaptionStew, a 10.7M caption dataset aggregating diverse open-source audio-text corpora across multiple domains and captioning styles. Using this resource, we conduct the first comprehensive evaluation comparing contrastive and captioning objectives for audio representation learning across speech, music, and environmental sound tasks. Our results demonstrate that audio-language pretraining yields competitive, transferable representations. Through systematic data-scaling experiments, we reveal complementary objective strengths: contrastive learning achieves superior data efficiency at smaller scales, while captioning demonstrates better scalability on language-involved audio understanding tasks. We also find that common supervised initialization practices provide diminishing returns at scale, challenging current approaches. These findings establish audio-language pretraining as a viable pathway toward general-purpose audio representations, guiding future research. To accelerate progress, we release data preparation recipes, training protocols, and pretrained models, paving the way toward universal audio understanding.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited audio-text corpora for pretraining

Evaluating contrastive vs captioning objectives systematically

Overcoming barriers to general-purpose audio representation learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aggregated diverse audio-text corpora into CaptionStew dataset

Compared contrastive and captioning objectives for audio representation learning

Revealed complementary strengths in data efficiency and scalability

🔎 Similar Papers

Audio xLSTMs: Learning Self-supervised audio representations with xLSTMs