Robust Neural Audio Fingerprinting using Music Foundation Models

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing audio fingerprinting techniques exhibit insufficient robustness against pervasive audio degradations—such as compression, distortion, and human editing—commonly encountered in platforms like TikTok. To address this, we propose an end-to-end robust fingerprinting method grounded in pre-trained music foundation models (MuQ/MERT): we leverage these models as backbones to extract highly discriminative audio representations, and explicitly model diverse degradation patterns—including time-stretching, pitch shifting, lossy compression, and band-limited filtering—via a comprehensive set of augmentation strategies. Our approach significantly enhances cross-variant matching capability and frame-level temporal localization accuracy. Evaluated on multiple benchmarks, it outperforms state-of-the-art methods including NAFP and GraFPrint. Notably, this work constitutes the first systematic validation and performance breakthrough of music foundation models for robust audio fingerprinting, establishing a new paradigm for handling real-world audio distortions.

Technology Category

Application Category

📝 Abstract

The proliferation of distorted, compressed, and manipulated music on modern media platforms like TikTok motivates the development of more robust audio fingerprinting techniques to identify the sources of musical recordings. In this paper, we develop and evaluate new neural audio fingerprinting techniques with the aim of improving their robustness. We make two contributions to neural fingerprinting methodology: (1) we use a pretrained music foundation model as the backbone of the neural architecture and (2) we expand the use of data augmentation to train fingerprinting models under a wide variety of audio manipulations, including time streching, pitch modulation, compression, and filtering. We systematically evaluate our methods in comparison to two state-of-the-art neural fingerprinting models: NAFP and GraFPrint. Results show that fingerprints extracted with music foundation models (e.g., MuQ, MERT) consistently outperform models trained from scratch or pretrained on non-musical audio. Segment-level evaluation further reveals their capability to accurately localize fingerprint matches, an important practical feature for catalog management.

Problem

Research questions and friction points this paper is trying to address.

Developing robust audio fingerprinting to identify music sources on modern platforms

Improving neural fingerprint robustness using pretrained music foundation models

Enhancing fingerprint accuracy under various audio manipulations through data augmentation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained music foundation models as backbone

Expands data augmentation for diverse audio manipulations

Outperforms existing neural fingerprinting models in robustness

🔎 Similar Papers

No similar papers found.