Enhancing Neural Audio Fingerprint Robustness to Audio Degradation for Music Identification

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of neural audio fingerprints under realistic audio degradations—such as reverberation, noise, and compression—this paper proposes a self-supervised robust fingerprint learning framework tailored for music identification. Methodologically, it integrates musical signal priors with physically grounded room acoustics modeling to generate high-fidelity synthetic degradations; further, it introduces a multi-positive triplet loss, which is systematically evaluated and shown to significantly outperform mainstream metric learning losses (e.g., NT-Xent) in audio fingerprinting. Key contributions include: (1) a set of practical guidelines to enhance self-supervised pretraining quality; (2) the first benchmark jointly incorporating both synthetic degradations and real-world recordings; and (3) state-of-the-art performance on large-scale degraded datasets and authentic field recordings, with notably improved identification accuracy under challenging acoustic conditions.

Technology Category

Application Category

📝 Abstract
Audio fingerprinting (AFP) allows the identification of unknown audio content by extracting compact representations, termed audio fingerprints, that are designed to remain robust against common audio degradations. Neural AFP methods often employ metric learning, where representation quality is influenced by the nature of the supervision and the utilized loss function. However, recent work unrealistically simulates real-life audio degradation during training, resulting in sub-optimal supervision. Additionally, although several modern metric learning approaches have been proposed, current neural AFP methods continue to rely on the NT-Xent loss without exploring the recent advances or classical alternatives. In this work, we propose a series of best practices to enhance the self-supervision by leveraging musical signal properties and realistic room acoustics. We then present the first systematic evaluation of various metric learning approaches in the context of AFP, demonstrating that a self-supervised adaptation of the triplet loss yields superior performance. Our results also reveal that training with multiple positive samples per anchor has critically different effects across loss functions. Our approach is built upon these insights and achieves state-of-the-art performance on both a large, synthetically degraded dataset and a real-world dataset recorded using microphones in diverse music venues.
Problem

Research questions and friction points this paper is trying to address.

Improving neural audio fingerprint robustness against degradation
Exploring better metric learning approaches for audio fingerprinting
Enhancing self-supervision with realistic acoustics and music properties
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging musical signal properties for self-supervision
Systematic evaluation of metric learning approaches
Self-supervised adaptation of triplet loss
🔎 Similar Papers
No similar papers found.