🤖 AI Summary
Fréchet Audio Distance (FAD) suffers from reliance on Gaussian assumptions, instability under small sample sizes, and high computational cost—limiting its reliability and practicality for audio generation evaluation. To address these limitations, we propose Kernel Audio Distance (KAD), the first audio distance metric grounded in Maximum Mean Discrepancy (MMD) that is distribution-free, unbiased, and computationally efficient. KAD leverages pretrained audio embeddings (e.g., OpenL3, PANNs) and a learnable kernel function, enabling GPU-accelerated parallel computation. Experiments demonstrate that KAD achieves rapid convergence—stabilizing with only one-fifth the sample size required by FAD—while delivering a 3.2× speedup in computation. Moreover, KAD exhibits significantly improved correlation with human perceptual judgments, increasing Pearson correlation by 27% over FAD. The implementation is publicly released as the open-source `kadtk` toolkit.
📝 Abstract
Although being widely adopted for evaluating generated audio signals, the Fr'echet Audio Distance (FAD) suffers from significant limitations, including reliance on Gaussian assumptions, sensitivity to sample size, and high computational complexity. As an alternative, we introduce the Kernel Audio Distance (KAD), a novel, distribution-free, unbiased, and computationally efficient metric based on Maximum Mean Discrepancy (MMD). Through analysis and empirical validation, we demonstrate KAD's advantages: (1) faster convergence with smaller sample sizes, enabling reliable evaluation with limited data; (2) lower computational cost, with scalable GPU acceleration; and (3) stronger alignment with human perceptual judgments. By leveraging advanced embeddings and characteristic kernels, KAD captures nuanced differences between real and generated audio. Open-sourced in the kadtk toolkit, KAD provides an efficient, reliable, and perceptually aligned benchmark for evaluating generative audio models.