AI-Generated Song Detection via Lyrics Transcripts

📅 2025-06-23

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Audio-based AI-generated song detection suffers from poor generalizability, while lyrics-dependent methods lack practicality due to reliance on clean, aligned textual annotations. Method: We propose an ASR-driven cross-lingual robust detection framework: Whisper large-v2 transcribes multilingual audio into lyrics; LLM2Vec encodes the transcriptions into semantic embeddings; and ensemble classifiers perform end-to-end detection. Contribution/Results: This work is the first to leverage a general-purpose ASR model as a modality-bridging hub—enabling reliable lyric extraction without ground-truth lyrics and effectively linking audio and text modalities. The framework significantly improves generalization against unseen generative models and common audio perturbations (e.g., compression, reverberation). Experiments on diverse multi-genre, multilingual datasets demonstrate consistent superiority over state-of-the-art audio-only detectors, achieving F1-score gains of 8.2–14.6%. It establishes the first high-robustness solution for real-world “audio-only input” scenarios.

Technology Category

Application Category

📝 Abstract

The recent rise in capabilities of AI-based music generation tools has created an upheaval in the music industry, necessitating the creation of accurate methods to detect such AI-generated content. This can be done using audio-based detectors; however, it has been shown that they struggle to generalize to unseen generators or when the audio is perturbed. Furthermore, recent work used accurate and cleanly formatted lyrics sourced from a lyrics provider database to detect AI-generated music. However, in practice, such perfect lyrics are not available (only the audio is); this leaves a substantial gap in applicability in real-life use cases. In this work, we instead propose solving this gap by transcribing songs using general automatic speech recognition (ASR) models. We do this using several detectors. The results on diverse, multi-genre, and multi-lingual lyrics show generally strong detection performance across languages and genres, particularly for our best-performing model using Whisper large-v2 and LLM2Vec embeddings. In addition, we show that our method is more robust than state-of-the-art audio-based ones when the audio is perturbed in different ways and when evaluated on different music generators. Our code is available at https://github.com/deezer/robust-AI-lyrics-detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting AI-generated songs using imperfect lyrics transcripts

Overcoming limitations of audio-based detection methods

Ensuring robustness across languages, genres, and audio perturbations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses ASR models for lyrics transcription

Employs Whisper large-v2 and LLM2Vec embeddings

Robust against audio perturbations and generators

🔎 Similar Papers

Synthetic Lyrics Detection Across Languages and Genres