Understanding Frechet Speech Distance for Synthetic Speech Quality Evaluation

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

251K/year

🤖 AI Summary

This study addresses the long-standing reliance on costly and non-scalable human listening tests for objective evaluation of synthetic speech quality. It systematically evaluates the Fréchet Speech Distance (FSD) and its variant, Sliced Maximum Mean Discrepancy (SMMD), across various speech embeddings—including the WavLM family—and diverse experimental conditions. The perceptual relevance of these metrics is validated through subjective listening tests, text-to-speech (TTS) intelligibility assessments, and automatic speech recognition (ASR) word error rates. For the first time, a comprehensive analysis demonstrates that FSD and SMMD exhibit strong correlation with human judgments, with WavLM Base+ embeddings yielding the most consistent and highly correlated results. This work establishes FSD/SMMD as viable, low-cost, and reproducible proxy metrics for synthetic speech quality evaluation.

Technology Category

Application Category

📝 Abstract

Objective evaluation of synthetic speech quality remains a critical challenge. Human listening tests are the gold standard, but costly and impractical at scale. Fr\'echet Distance has emerged as a promising alternative, yet its reliability depends heavily on the choice of embeddings and experimental settings. In this work, we comprehensively evaluate Fr\'echet Speech Distance (FSD) and its variant Speech Maximum Mean Discrepancy (SMMD) under varied embeddings and conditions. We further incorporate human listening evaluations alongside TTS intelligibility and synthetic-trained ASR WER to validate the perceptual relevance of these metrics. Our findings show that WavLM Base+ features yield the most stable alignment with human ratings. While FSD and SMMD cannot fully replace subjective evaluation, we show that they can serve as complementary, cost-efficient, and reproducible measures, particularly useful when large-scale or direct listening assessments are infeasible. Code is available at https://github.com/kaen2891/FrechetSpeechDistance.

Problem

Research questions and friction points this paper is trying to address.

synthetic speech quality

objective evaluation

Fréchet Speech Distance

human listening tests

speech embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fréchet Speech Distance

Speech Quality Evaluation

WavLM embeddings

Objective Metrics

Synthetic Speech Assessment

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection

2024-09-23arXiv.orgCitations: 1