Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This paper addresses zero-shot music stem retrieval: given a mixture, retrieving an unseen stem that harmonically complements it—without any labeled examples of the target stem. We propose Instrument-Conditioned Joint Embedding Prediction Architecture (JEPA), the first method to explicitly condition zero-shot stem matching on instrument class. To enhance cross-stem semantic alignment, we pretrain the audio encoder via contrastive learning, significantly improving representation discriminability across diverse stems. Our approach achieves state-of-the-art performance on MUSDB18 and MoisesDB, enabling zero-shot retrieval for arbitrary instruments—including those unseen during training. Ablation studies and embedding visualizations confirm that the learned representations effectively preserve temporal dynamics and local structural cues essential for musical compatibility.

Technology Category

Application Category

📝 Abstract

In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.

Problem

Research questions and friction points this paper is trying to address.

Zero-shot musical stem retrieval

Joint-Embedding Predictive Architectures

Contrastive learning improves performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint-Embedding Predictive Architectures

Zero-shot stem retrieval

Contrastive learning pretraining

🔎 Similar Papers

No similar papers found.