Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses self-supervised representation learning for audio, aiming to enhance the downstream classification suitability of latent-space representations. We propose a dual-agent task framework that jointly optimizes masked latent prediction (to preserve structural fidelity) and teacher-student distribution matching for unsupervised classification (to improve discriminability)—the first introduction of latent-space probabilistic alignment into audio self-supervised learning. Our method integrates contrastive learning with knowledge distillation, employing momentum-updated teacher networks to stabilize distribution estimation. Evaluated on benchmarks including OpenMIC and GTZAN, our approach achieves state-of-the-art self-supervised performance. Notably, it surpasses fully supervised baselines on the MagnaTagATune music tagging task, demonstrating superior generalization and discriminative capability of the learned representations.

Technology Category

Application Category

📝 Abstract
Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.
Problem

Research questions and friction points this paper is trying to address.

Improves self-supervised audio representation learning
Enhances latent space for classification tasks
Proposes MATPAC for joint pretext tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked latent prediction
Unsupervised classification
Joint pretext tasks
🔎 Similar Papers
A
Aurian Quelennec
LTCI, Télémécom Paris, Institut Polytechnique de Paris, Palaiseau, France
P
Pierre Chouteau
LTCI, Télémécom Paris, Institut Polytechnique de Paris, Palaiseau, France
Geoffroy Peeters
Geoffroy Peeters
Télécom Paris (previously IRCAM - STMS)
audio signal processingmachine learningmusic information retrieval
Slim Essid
Slim Essid
NVIDIA
Machine LearningAIMultimodal Language ModelsMIRAudio and Speech Processing