🤖 AI Summary
Existing text-to-music (TTM) evaluation metrics—such as Frechet Audio Distance (FAD)—exhibit severe deficiencies in aligning with human preferences and capturing musically meaningful semantics.
Method: We systematically analyze these limitations, introduce MusicPrefs—the first open-source TTM human preference dataset—and propose MAUVE Audio Divergence (MAD), a novel metric built upon self-supervised audio representations (wav2vec 2.0/HuBERT) and the MAUVE distributional comparison framework. MAD is rigorously validated via synthetic meta-evaluation and large-scale human annotation.
Contribution/Results: Experiments demonstrate that MAD achieves an average rank correlation of 0.84 with music-specific attributes (vs. 0.49 for FAD) and a correlation of 0.62 with human preferences (vs. 0.14 for FAD), substantially outperforming prior metrics. This work establishes the first human-preference benchmark for TTM and delivers a reproducible, perceptually consistent evaluation paradigm.
📝 Abstract
Despite significant recent advances in generative acoustic text-to-music (TTM) modeling, robust evaluation of these models lags behind, relying in particular on the popular Fr'echet Audio Distance (FAD). In this work, we rigorously study the design space of reference-based divergence metrics for evaluating TTM models through (1) designing four synthetic meta-evaluations to measure sensitivity to particular musical desiderata, and (2) collecting and evaluating on MusicPrefs, the first open-source dataset of human preferences for TTM systems. We find that not only is the standard FAD setup inconsistent on both synthetic and human preference data, but that nearly all existing metrics fail to effectively capture desiderata, and are only weakly correlated with human perception. We propose a new metric, the MAUVE Audio Divergence (MAD), computed on representations from a self-supervised audio embedding model. We find that this metric effectively captures diverse musical desiderata (average rank correlation 0.84 for MAD vs. 0.49 for FAD and also correlates more strongly with MusicPrefs (0.62 vs. 0.14).