🤖 AI Summary
Current music generation models lack a universal, reproducible metric for quantifying audio prompt adherence, hindering model development and cross-method evaluation. To address this, we propose a lightweight, modular evaluation framework that integrates pretrained audio embeddings (OpenL3 and PANNs), linear projection, Fréchet Audio Distance (FAD), and a mean-covariance fusion strategy to quantify adherence across three dimensions: timbral style, tonality, and rhythm. We present the first systematic validation of FAD for cross-dataset prompt adherence assessment. Empirical results demonstrate that our metric exhibits stable sensitivity to ±3 semitone pitch shifts and ±200 ms temporal offsets, while effectively discriminating between distinct levels of adherence. The framework is computationally efficient, interpretable, and compatible with diverse generative architectures. All code is publicly released to ensure reproducibility and facilitate community adoption.
📝 Abstract
An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fr'echet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.