SpeechJudge: Towards Human-Level Judgment for Speech Naturalness

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

The scarcity of large-scale human preference data for speech synthesis hinders alignment between model outputs and human perception—particularly on subjective metrics like naturalness. To address this, we introduce SpeechJudge, the first comprehensive evaluation framework specifically designed for speech naturalness. It comprises a large-scale, multilingual pairwise preference dataset, a challenging benchmark (SpeechJudge-Eval), and a generative reward model trained via chain-of-thought prompting and GRPO-based reinforcement learning. The preference data are generated by zero-shot TTS models and rigorously annotated by human raters. Our reward model is built upon Qwen2.5-Omni-7B, enhanced through supervised fine-tuning and inference-time scaling. On SpeechJudge-Eval, it achieves a 79.4% human agreement rate—significantly outperforming the Bradley-Terry baseline (72.7%)—and effectively guides post-training optimization of TTS systems to improve naturalness alignment.

Technology Category

Application Category

📝 Abstract

Aligning large generative models with human feedback is a critical challenge. In speech synthesis, this is particularly pronounced due to the lack of a large-scale human preference dataset, which hinders the development of models that truly align with human perception. To address this, we introduce SpeechJudge, a comprehensive suite comprising a dataset, a benchmark, and a reward model centered on naturalness--one of the most fundamental subjective metrics for speech synthesis. First, we present SpeechJudge-Data, a large-scale human feedback corpus of 99K speech pairs. The dataset is constructed using a diverse set of advanced zero-shot text-to-speech (TTS) models across diverse speech styles and multiple languages, with human annotations for both intelligibility and naturalness preference. From this, we establish SpeechJudge-Eval, a challenging benchmark for speech naturalness judgment. Our evaluation reveals that existing metrics and AudioLLMs struggle with this task; the leading model, Gemini-2.5-Flash, achieves less than 70% agreement with human judgment, highlighting a significant gap for improvement. To bridge this gap, we develop SpeechJudge-GRM, a generative reward model (GRM) based on Qwen2.5-Omni-7B. It is trained on SpeechJudge-Data via a two-stage post-training process: Supervised Fine-Tuning (SFT) with Chain-of-Thought rationales followed by Reinforcement Learning (RL) with GRPO on challenging cases. On the SpeechJudge-Eval benchmark, the proposed SpeechJudge-GRM demonstrates superior performance, achieving 77.2% accuracy (and 79.4% after inference-time scaling @10) compared to a classic Bradley-Terry reward model (72.7%). Furthermore, SpeechJudge-GRM can be also employed as a reward function during the post-training of speech generation models to facilitate their alignment with human preferences.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale human preference dataset for speech synthesis

Existing metrics struggle with accurate speech naturalness judgment

Need to bridge performance gap between AI and human evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale human feedback dataset for speech naturalness

Generative reward model trained via SFT and RL

Superior accuracy in aligning with human judgment

🔎 Similar Papers

MAD Speech: Measures of Acoustic Diversity of Speech