Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the long-standing reliance on costly and non-scalable human listening tests for objective evaluation of synthetic speech quality. It systematically evaluates the Fréchet Speech Distance (FSD) and its variant, Sliced Maximum Mean Discrepancy (SMMD), across various speech embeddings—including the WavLM family—and diverse experimental conditions. The perceptual relevance of these metrics is validated through subjective listening tests, text-to-speech (TTS) intelligibility assessments, and automatic speech recognition (ASR) word error rates. For the first time, a comprehensive analysis demonstrates that FSD and SMMD exhibit strong correlation with human judgments, with WavLM Base+ embeddings yielding the most consistent and highly correlated results. This work establishes FSD/SMMD as viable, low-cost, and reproducible proxy metrics for synthetic speech quality evaluation.

Technology Category

Application Category

📝 Abstract
Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fr\'echet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fr\'echet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.
Problem

Research questions and friction points this paper is trying to address.

synthetic speech quality
objective evaluation
Fréchet Speech Distance
human listening tests
speech embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fréchet Speech Distance
Speech Quality Evaluation
WavLM embeddings
Objective Metrics
Synthetic Speech Assessment
🔎 Similar Papers
J
June-Woo Kim
Gwangju Institute of Science and Technology, Republic of Korea
Dhruv Agarwal
Dhruv Agarwal
University of Massachusetts Amherst
Machine LearningNatural Language ProcessingSearch and Discovery
F
Federica Cerina
Amazon Science, USA