🤖 AI Summary
Standard cross-entropy loss struggles to effectively evaluate the generation quality of large language models for music and may even decrease when inputs are corrupted, thereby failing as a reliable quality metric. This work systematically investigates the model’s loss response to controlled perturbations in musical context, revealing—through targeted injection of noise—that the model is significantly more sensitive to local textural distortions than to global semantic alterations. Building on this insight, the study proposes using the shape of the loss curve, rather than its absolute magnitude, as an unsupervised, model-intrinsic indicator of generation quality. It further demonstrates that the peak in loss induced by brief noise injections serves as a robust proxy for musical integrity, establishing a novel paradigm for evaluating music generation quality without human annotations.
📝 Abstract
The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from"garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak"area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.