KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fréchet Audio Distance (FAD) suffers from reliance on Gaussian assumptions, instability under small sample sizes, and high computational cost—limiting its reliability and practicality for audio generation evaluation. To address these limitations, we propose Kernel Audio Distance (KAD), the first audio distance metric grounded in Maximum Mean Discrepancy (MMD) that is distribution-free, unbiased, and computationally efficient. KAD leverages pretrained audio embeddings (e.g., OpenL3, PANNs) and a learnable kernel function, enabling GPU-accelerated parallel computation. Experiments demonstrate that KAD achieves rapid convergence—stabilizing with only one-fifth the sample size required by FAD—while delivering a 3.2× speedup in computation. Moreover, KAD exhibits significantly improved correlation with human perceptual judgments, increasing Pearson correlation by 27% over FAD. The implementation is publicly released as the open-source `kadtk` toolkit.

Technology Category

Application Category

📝 Abstract
Although being widely adopted for evaluating generated audio signals, the Fr'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.
Problem

Research questions and friction points this paper is trying to address.

FAD has limitations in audio evaluation metrics
KAD offers efficient unbiased audio quality assessment
KAD aligns better with human perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Kernel Audio Distance replaces Fréchet Audio Distance
Uses Maximum Mean Discrepancy for unbiased evaluation
Scalable GPU acceleration reduces computational cost