🤖 AI Summary
This paper addresses zero-shot music stem retrieval: given a mixture, retrieving an unseen stem that harmonically complements it—without any labeled examples of the target stem. We propose Instrument-Conditioned Joint Embedding Prediction Architecture (JEPA), the first method to explicitly condition zero-shot stem matching on instrument class. To enhance cross-stem semantic alignment, we pretrain the audio encoder via contrastive learning, significantly improving representation discriminability across diverse stems. Our approach achieves state-of-the-art performance on MUSDB18 and MoisesDB, enabling zero-shot retrieval for arbitrary instruments—including those unseen during training. Ablation studies and embedding visualizations confirm that the learned representations effectively preserve temporal dynamics and local structural cues essential for musical compatibility.
📝 Abstract
In this paper, we tackle the task of musical stem retrieval. Given a musical mix, it consists in retrieving a stem that would fit with it, i.e., that would sound pleasant if played together. To do so, we introduce a new method based on Joint-Embedding Predictive Architectures, where an encoder and a predictor are jointly trained to produce latent representations of a context and predict latent representations of a target. In particular, we design our predictor to be conditioned on arbitrary instruments, enabling our model to perform zero-shot stem retrieval. In addition, we discover that pretraining the encoder using contrastive learning drastically improves the model's performance. We validate the retrieval performances of our model using the MUSDB18 and MoisesDB datasets. We show that it significantly outperforms previous baselines on both datasets, showcasing its ability to support more or less precise (and possibly unseen) conditioning. We also evaluate the learned embeddings on a beat tracking task, demonstrating that they retain temporal structure and local information.