AudioMosaic: Contrastive Masked Audio Representation Learning

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of insufficient augmentation strategies and high memory consumption in audio self-supervised contrastive learning by proposing AudioMosaic, a novel contrastive learning framework based on structured time–frequency masking. By applying structured masks to spectrogram patches to construct high-quality positive pairs, AudioMosaic substantially reduces training memory usage while enhancing the discriminability and cross-domain transferability of learned representations. Integrated with large-batch training and evaluated through both linear probing and fine-tuning protocols, the method achieves state-of-the-art performance across multiple standard audio benchmarks and effectively boosts the performance of audio–language models in cross-modal tasks.

📝 Abstract

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

Problem

Research questions and friction points this paper is trying to address.

audio self-supervised learning

contrastive learning

audio representation learning

data augmentation

large-batch training

Innovation

Methods, ideas, or system contributions that make the work stand out.

contrastive learning

structured masking

audio self-supervised learning