Measuring audio prompt adherence with distribution-based embedding distances

📅 2024-03-31

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Current music generation models lack a universal, reproducible metric for quantifying audio prompt adherence, hindering model development and cross-method evaluation. To address this, we propose a lightweight, modular evaluation framework that integrates pretrained audio embeddings (OpenL3 and PANNs), linear projection, Fréchet Audio Distance (FAD), and a mean-covariance fusion strategy to quantify adherence across three dimensions: timbral style, tonality, and rhythm. We present the first systematic validation of FAD for cross-dataset prompt adherence assessment. Empirical results demonstrate that our metric exhibits stable sensitivity to ±3 semitone pitch shifts and ±200 ms temporal offsets, while effectively discriminating between distinct levels of adherence. The framework is computationally efficient, interpretable, and compatible with diverse generative architectures. All code is publicly released to ensure reproducibility and facilitate community adoption.

Technology Category

Application Category

📝 Abstract

An increasing number of generative music models can be conditioned on an audio prompt that serves as musical context for which the model is to create an accompaniment (often further specified using a text prompt). Evaluation of how well model outputs adhere to the audio prompt is often done in a model or problem specific manner, presumably because no generic evaluation method for audio prompt adherence has emerged. Such a method could be useful both in the development and training of new models, and to make performance comparable across models. In this paper we investigate whether commonly used distribution-based distances like Fr'echet Audio Distance (FAD), can be used to measure audio prompt adherence. We propose a simple procedure based on a small number of constituents (an embedding model, a projection, an embedding distance, and a data fusion method), that we systematically assess using a baseline validation. In a follow-up experiment we test the sensitivity of the proposed audio adherence measure to pitch and time shift perturbations. The results show that the proposed measure is sensitive to such perturbations, even when the reference and candidate distributions are from different music collections. Although more experimentation is needed to answer unaddressed questions like the robustness of the measure to acoustic artifacts that do not affect the audio prompt adherence, the current results suggest that distribution-based embedding distances provide a viable way of measuring audio prompt adherence. An python/pytorch implementation of the proposed measure is publicly available as a github repository.

Problem

Research questions and friction points this paper is trying to address.

Music Generation Evaluation

Style Consistency

Universal Standard

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distribution-based Embedding Distance

Fréchet Audio Distance (FAD)

Style Consistency Evaluation

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech