Benchmarking Music Generation Models and Metrics via Human Preference Studies

📅 2025-06-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the misalignment between human subjective preferences and objective evaluation metrics in music generation, focusing on text-audio alignment and musical quality. We construct a large-scale benchmark comprising 6,000 generated tracks from 12 state-of-the-art models and conduct 15,000 pairwise auditory comparison trials across 2,500 human participants—the first such large-scale human preference study in the domain. Through rigorous statistical correlation analysis, we systematically quantify the consistency between automated metrics (e.g., CLAP, FAD) and human judgments, revealing substantial discrepancies between existing metrics and true perceptual preferences. Our work delivers the most comprehensive ranking of both models and evaluation metrics to date, and publicly releases a high-quality human preference dataset. This resource establishes a robust, human-centered benchmark to drive paradigm shifts in evaluation methodology and model optimization for text-to-music generation.

Technology Category

Application Category

📝 Abstract
Recent advancements have brought generated music closer to human-created compositions, yet evaluating these models remains challenging. While human preference is the gold standard for assessing quality, translating these subjective judgments into objective metrics, particularly for text-audio alignment and music quality, has proven difficult. In this work, we generate 6k songs using 12 state-of-the-art models and conduct a survey of 15k pairwise audio comparisons with 2.5k human participants to evaluate the correlation between human preferences and widely used metrics. To the best of our knowledge, this work is the first to rank current state-of-the-art music generation models and metrics based on human preference. To further the field of subjective metric evaluation, we provide open access to our dataset of generated music and human evaluations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating music generation models via human preferences
Assessing text-audio alignment and music quality metrics
Ranking state-of-the-art models using human survey data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates 6k songs using 12 advanced models
Conducts 15k human pairwise audio comparisons
Ranks models and metrics via human preference