Double Entendre: Robust Audio-Based AI-Generated Lyrics Detection via Multi-View Fusion

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing AI-generated lyric detection methods suffer from two key limitations: poor generalizability and weak robustness of audio-based detectors, and heavy reliance on high-quality human-annotated lyrics—often unavailable in real-world scenarios. This paper proposes DE-detect, a multi-view late-fusion detection framework operating solely on raw audio without requiring manual lyric annotations. Its core innovation is the “audio-intrinsic lyric perception” paradigm, which jointly models speech and transcribed text through automatic speech recognition (ASR)-based transcription, phoneme-level speech representation learning, and cross-modal feature alignment. The modular architecture significantly enhances generalizability across diverse generative models and robustness against common audio distortions—including MP3 compression and resampling. Evaluated on a multi-source AI music dataset, DE-detect achieves superior accuracy over state-of-the-art methods and maintains >92% detection accuracy under adversarial perturbations. The code is publicly available.

Technology Category

Application Category

📝 Abstract

The rapid advancement of AI-based music generation tools is revolutionizing the music industry but also posing challenges to artists, copyright holders, and providers alike. This necessitates reliable methods for detecting such AI-generated content. However, existing detectors, relying on either audio or lyrics, face key practical limitations: audio-based detectors fail to generalize to new or unseen generators and are vulnerable to audio perturbations; lyrics-based methods require cleanly formatted and accurate lyrics, unavailable in practice. To overcome these limitations, we propose a novel, practically grounded approach: a multimodal, modular late-fusion pipeline that combines automatically transcribed sung lyrics and speech features capturing lyrics-related information within the audio. By relying on lyrical aspects directly from audio, our method enhances robustness, mitigates susceptibility to low-level artifacts, and enables practical applicability. Experiments show that our method, DE-detect, outperforms existing lyrics-based detectors while also being more robust to audio perturbations. Thus, it offers an effective, robust solution for detecting AI-generated music in real-world scenarios. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

Problem

Research questions and friction points this paper is trying to address.

Detect AI-generated music robustly via multi-modal fusion

Overcome limitations of audio-only or lyrics-only detectors

Enhance robustness against audio perturbations and low-level artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal late-fusion pipeline combining audio and lyrics

Uses automatically transcribed sung lyrics

Robust to audio perturbations and low-level artifacts

🔎 Similar Papers

Synthetic Lyrics Detection Across Languages and Genres