🤖 AI Summary
Audio-based AI-generated song detection suffers from poor generalizability, while lyrics-dependent methods lack practicality due to reliance on clean, aligned textual annotations.
Method: We propose an ASR-driven cross-lingual robust detection framework: Whisper large-v2 transcribes multilingual audio into lyrics; LLM2Vec encodes the transcriptions into semantic embeddings; and ensemble classifiers perform end-to-end detection.
Contribution/Results: This work is the first to leverage a general-purpose ASR model as a modality-bridging hub—enabling reliable lyric extraction without ground-truth lyrics and effectively linking audio and text modalities. The framework significantly improves generalization against unseen generative models and common audio perturbations (e.g., compression, reverberation). Experiments on diverse multi-genre, multilingual datasets demonstrate consistent superiority over state-of-the-art audio-only detectors, achieving F1-score gains of 8.2–14.6%. It establishes the first high-robustness solution for real-world “audio-only input” scenarios.
📝 Abstract
The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.