🤖 AI Summary
This work addresses the problem of automatic identification and provenance tracing of musical samples in multi-track audio. The proposed method introduces a self-supervised contrastive learning framework specifically designed for this task. Its core innovation lies in generating high-fidelity artificial remix positive pairs via state-of-the-art source separation, thereby enabling contrastive learning objectives grounded in cross-version audio matching—effectively capturing sample variability across diverse mixing conditions, noise corruptions, and musical genres. The approach integrates audio signal processing, stem separation, and large-scale negative sampling retrieval, eliminating the need for manual annotations. Experiments demonstrate strong generalization across multiple genres and high robustness to noise, significantly outperforming prior state-of-the-art methods. Moreover, the method exhibits excellent scalability and stability when the reference database expands or when noisy query samples increase.
📝 Abstract
Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.