MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of evaluating individual AI-generated music clips, as existing audio quality metrics exhibit poor correlation with human judgments and lack open-source implementations. Leveraging a frozen MuQ-310M feature extractor and the MusicEval dataset—which comprises 31 text-to-music system outputs annotated with expert scores—the authors train a lightweight prediction head to create the first high-performance, fully open-source model for single-sample AI music quality assessment. Requiring only 150 labeled samples for fine-tuning personalized evaluators, the model enables real-time inference on a single consumer-grade GPU. It achieves system-level and clip-level Spearman rank correlation coefficients (SRCC) of 0.957 and 0.838, respectively, substantially outperforming existing open-source metrics while demonstrating sensitivity to signal distortions and robustness to variations in musical structure.

Technology Category

Application Category

📝 Abstract

Distributional metrics such as Fréchet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.

Problem

Research questions and friction points this paper is trying to address.

AI music generation

quality evaluation

per-sample metric

human correlation

open-source

Innovation

Methods, ideas, or system contributions that make the work stand out.

per-sample evaluation

frozen feature representation

MuQ-Eval