Automatic Identification of Samples in Hip-Hop Music via Multi-Loss Training and an Artificial Dataset

📅 2025-02-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automatic identification and precise localization of short (a few seconds), pitch- and time-stretched real-world audio samples in commercial hip-hop music remain challenging due to sparse ground-truth annotations and robustness requirements. Method: This paper proposes a multi-loss CNN framework integrating classification and metric learning. It introduces a synthetically generated dataset comprising isolated vocal, harmonic, and percussive sources, augmented via audio source separation–driven data synthesis to alleviate annotation scarcity. The framework employs multi-scale feature extraction and robust audio fingerprint modeling. Results: On real hip-hop tracks, the method achieves a 13% improvement in sample identification accuracy over conventional acoustic landmark–based approaches. It robustly handles concurrent pitch shifting and time stretching, with localization errors within ±5 seconds for 50% of test tracks.

Technology Category

Application Category

📝 Abstract
Sampling, the practice of reusing recorded music or sounds from another source in a new work, is common in popular music genres like hip-hop and rap. Numerous services have emerged that allow users to identify connections between samples and the songs that incorporate them, with the goal of enhancing music discovery. Designing a system that can perform the same task automatically is challenging, as samples are commonly altered with audio effects like pitch- and time-stretching and may only be seconds long. Progress on this task has been minimal and is further blocked by the limited availability of training data. Here, we show that a convolutional neural network trained on an artificial dataset can identify real-world samples in commercial hip-hop music. We extract vocal, harmonic, and percussive elements from several databases of non-commercial music recordings using audio source separation, and train the model to fingerprint a subset of these elements in transformed versions of the original audio. We optimize the model using a joint classification and metric learning loss and show that it achieves 13% greater precision on real-world instances of sampling than a fingerprinting system using acoustic landmarks, and that it can recognize samples that have been both pitch shifted and time stretched. We also show that, for half of the commercial music recordings we tested, our model is capable of locating the position of a sample to within five seconds.
Problem

Research questions and friction points this paper is trying to address.

Automatic sample identification in hip-hop music
Overcoming audio effect alterations in samples
Addressing limited training data for sampling detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-loss training for improved accuracy
Artificial dataset for sample identification
Convolutional neural network for audio analysis
🔎 Similar Papers
No similar papers found.