Supervised contrastive learning from weakly-labeled audio segments for musical version matching

📅 2025-02-24

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Existing music version matching methods predominantly operate at the full-track level, failing to support precise matching of short audio segments (e.g., 20 seconds) required in real-world applications; moreover, they commonly rely on classification or triplet loss, neglecting more advanced representation learning paradigms. To address this, we propose the first version matching framework tailored for weakly labeled audio segments. Our method introduces a weakly supervised segment-level distance reduction learning paradigm, incorporating a decoupled contrastive loss that enhances feature discriminability in both geometric structure and hyperparameter robustness. We further unify supervised contrastive learning with pairwise distance optimization to enable fine-grained modeling of weak labels. Experiments demonstrate state-of-the-art performance on standard track-level benchmarks, substantial gains in segment-level matching accuracy, and promising generalization capability to other time-series signal tasks.

Technology Category

Application Category

📝 Abstract

Detecting musical versions (different renditions of the same piece) is a challenging task with important applications. Because of the ground truth nature, existing approaches match musical versions at the track level (e.g., whole song). However, most applications require to match them at the segment level (e.g., 20s chunks). In addition, existing approaches resort to classification and triplet losses, disregarding more recent losses that could bring meaningful improvements. In this paper, we propose a method to learn from weakly annotated segments, together with a contrastive loss variant that outperforms well-studied alternatives. The former is based on pairwise segment distance reductions, while the latter modifies an existing loss following decoupling, hyper-parameter, and geometric considerations. With these two elements, we do not only achieve state-of-the-art results in the standard track-level evaluation, but we also obtain a breakthrough performance in a segment-level evaluation. We believe that, due to the generality of the challenges addressed here, the proposed methods may find utility in domains beyond audio or musical version matching.

Problem

Research questions and friction points this paper is trying to address.

Improving segment-level musical version matching

Utilizing weakly-labeled audio segments

Enhancing performance with a novel contrastive loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Weakly-labeled audio segments

Contrastive loss variant

Segment distance reductions

🔎 Similar Papers

COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

2024-04-25arXiv.orgCitations: 2

Bosch Group

Hildesheim, NDS, DE

Master Thesis Data-Efficient Hybrid Machine Learning for Robust Vibration System Prediction

Bosch Group

Renningen, BW, DE

Authors to Follow