SLAP: Siamese Language-Audio Pretraining Without Negative Samples for Music Understanding

📅 2025-06-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing joint embedding approaches suffer from two key limitations: high memory overhead—due to reliance on large-batch negative sampling—and inconsistent cross-modal manifolds—termed the “modality gap.” This paper introduces SLAP, the first negative-sample-free framework that adapts the BYOL paradigm to music–text multimodal pretraining. SLAP employs a Siamese architecture with momentum encoders and gradient accumulation to achieve robust cross-modal representation alignment. It substantially narrows the modality gap while enabling large-scale training on a single GPU, thereby improving scalability and training stability. Experiments demonstrate that SLAP outperforms CLAP on text–music retrieval and zero-shot classification, matches or exceeds larger-scale or supervised models across multiple MIR benchmarks, and exhibits superior robustness to batch-size variations.

Technology Category

Application Category

📝 Abstract
Joint embedding spaces have significantly advanced music understanding and generation by linking text and audio through multimodal contrastive learning. However, these approaches face large memory requirement limitations due to relying on large batch sizes to effectively utilize negative samples. Further, multimodal joint embedding spaces suffer from a modality gap wherein embeddings from different modalities lie in different manifolds of the embedding space. To address these challenges, we propose Siamese Language-Audio Pretraining (SLAP), a novel multimodal pretraining framework that allows learning powerful representations without negative samples. SLAP adapts the Bootstrap Your Own Latent (BYOL) paradigm for multimodal audio-text training, promoting scalability in training multimodal embedding spaces. We illustrate the ability of our model to learn meaningful relationships between music and text -- specifically, we show that SLAP outperforms CLAP on tasks such as text-music retrieval and zero-shot classification. We also observe competitive downstream performance on several MIR tasks, including with larger or supervised models (genre and instrument classification, auto-tagging). Additionally, our approach has attractive properties, such as a quantifiably reduced modality gap and improved robustness to batch size variations on retrieval performance. Finally, its novel formulation unlocks large-scale training on a single GPU through gradient accumulation.
Problem

Research questions and friction points this paper is trying to address.

Reduces memory needs by eliminating negative samples in multimodal learning
Bridges modality gap between text and audio embeddings
Enables scalable training on single GPU for music-text tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Siamese pretraining without negative samples
BYOL paradigm for multimodal training
Scalable single-GPU large-scale training
🔎 Similar Papers
No similar papers found.