SSLAM: Enhancing Self-Supervised Models with Audio Mixtures for Polyphonic Soundscapes

📅 2025-06-13

🏛️ International Conference on Learning Representations

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Existing self-supervised audio models are predominantly pre-trained and evaluated on monophonic (single-source) data, limiting their generalizability and robustness to real-world polyphonic (multi-source overlapping) audio. To address this, we propose Self-Supervised Learning for Audio Mixtures (SSLAM), the first unified framework jointly modeling monophonic and polyphonic audio. SSLAM integrates contrastive learning with masked reconstruction, augmented by a novel audio mixture synthesis strategy and a hierarchical feature disentanglement mechanism. Crucially, it preserves monophonic performance while substantially enhancing polyphonic robustness. On AudioSet-2M, SSLAM achieves 50.2 mAP (+3.9% absolute gain); on polyphonic benchmarks, it sets new state-of-the-art results under both linear evaluation and fine-tuning protocols, with up to +9.1% mAP improvement.

Technology Category

Application Category

📝 Abstract

Self-supervised pre-trained audio networks have seen widespread adoption in real-world systems, particularly in multi-modal large language models. These networks are often employed in a frozen state, under the assumption that the SSL pre-training has sufficiently equipped them to handle real-world audio. However, a critical question remains: how well do these models actually perform in real-world conditions, where audio is typically polyphonic and complex, involving multiple overlapping sound sources? Current audio SSL methods are often benchmarked on datasets predominantly featuring monophonic audio, such as environmental sounds, and speech. As a result, the ability of SSL models to generalize to polyphonic audio, a common characteristic in natural scenarios, remains underexplored. This limitation raises concerns about the practical robustness of SSL models in more realistic audio settings. To address this gap, we introduce Self-Supervised Learning from Audio Mixtures (SSLAM), a novel direction in audio SSL research, designed to improve, designed to improve the model's ability to learn from polyphonic data while maintaining strong performance on monophonic data. We thoroughly evaluate SSLAM on standard audio SSL benchmark datasets which are predominantly monophonic and conduct a comprehensive comparative analysis against SOTA methods using a range of high-quality, publicly available polyphonic datasets. SSLAM not only improves model performance on polyphonic audio, but also maintains or exceeds performance on standard audio SSL benchmarks. Notably, it achieves up to a 3.9% improvement on the AudioSet-2M (AS-2M), reaching a mean average precision (mAP) of 50.2. For polyphonic datasets, SSLAM sets new SOTA in both linear evaluation and fine-tuning regimes with performance improvements of up to 9.1% (mAP).

Problem

Research questions and friction points this paper is trying to address.

Evaluating SSL models in real-world polyphonic audio conditions

Addressing generalization gaps in SSL for complex soundscapes

Enhancing model robustness for overlapping sound sources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Enhances SSL models with polyphonic audio mixtures

Improves performance on polyphonic and monophonic datasets

Achieves SOTA results in linear and fine-tuning evaluations

🔎 Similar Papers

Compositional Audio Representation Learning