Melody or Machine: Detecting Synthetic Music with Dual-Stream Contrastive Learning

📅 2025-11-29

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

The rapid advancement of end-to-end AI music generation poses severe threats to artistic authenticity and copyright protection, while existing detection methods suffer from poor generalization to out-of-distribution (OOD) synthetic content. To address this, we propose CLAM, a dual-stream contrastive learning architecture that— for the first time—exploits subtle acoustic inconsistencies between vocal and instrumental representations to identify AI-generated music. CLAM incorporates a learnable cross-aggregation module and jointly optimizes binary cross-entropy loss with contrastive triplet loss. It employs parallel audio encoders—MERT and Wav2Vec 2.0—to extract complementary speech- and instrument-oriented acoustic features. Evaluated on MoM, a newly constructed large-scale, diverse benchmark comprising 130K tracks, CLAM achieves an F1 score of 0.925, substantially outperforming state-of-the-art methods. This demonstrates superior robustness and generalization capability for detecting AI-generated music across heterogeneous distributions.

Technology Category

Application Category

📝 Abstract

The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.

Problem

Research questions and friction points this paper is trying to address.

Detects synthetic music to protect artistic authenticity and copyright.

Addresses generalization failure of existing models on diverse AI-generated content.

Introduces a new benchmark and robust architecture for improved detection.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces MoM benchmark with diverse synthetic music dataset

Proposes CLAM dual-stream architecture with separate audio encoders

Uses contrastive triplet loss to detect synthetic vocal-instrument inconsistencies

🔎 Similar Papers

COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations

2024-04-25arXiv.orgCitations: 2

Apple

Seattle, United States of America

Research Scientist Intern, Multimodal AI (PhD)