SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

📅 2025-10-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech quality assessment methods rely heavily on scalar scores or binary decisions, suffering from poor interpretability and limited generalization across tasks and languages. Method: We propose the “SpeechLLM-as-Judges” paradigm, introducing SpeechEval—a large-scale, multilingual speech evaluation dataset—and SQ-LLM, a structured large language model for speech quality assessment. SQ-LLM incorporates chain-of-thought reasoning, reward optimization, and structured prompt learning. Contribution/Results: SQ-LLM is the first model to enable unified, multi-granularity speech evaluation—including fine-grained scoring, pairwise comparison, actionable improvement suggestions, and deepfake detection—within a single framework. It maintains robust zero-shot cross-lingual transfer performance. Experiments demonstrate significant gains in assessment transparency, generalizability, and task adaptability. This work establishes a novel, interpretable, and scalable paradigm for speech quality evaluation.

Technology Category

Application Category

📝 Abstract
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.
Problem

Research questions and friction points this paper is trying to address.

Evaluating perceptual quality of synthetic speech lacks interpretability
Existing methods have poor generalization across tasks and languages
Developing structured evaluation using large language models for speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs conduct structured speech quality evaluation
SpeechEval dataset with multilingual annotations across tasks
SQ-LLM trained with chain-of-thought reasoning optimization
🔎 Similar Papers
No similar papers found.
H
Hui Wang
Nankai University
Jinghua Zhao
Jinghua Zhao
Nankai University
Y
Yifan Yang
Microsoft Corporation
S
Shujie Liu
Microsoft Corporation
J
Junyang Chen
Nankai University
Y
Yanzhe Zhang
Nankai University
Shiwan Zhao
Shiwan Zhao
Independent Researcher, Research Scientist of IBM Research - China (2000-2020)
AGILarge Language ModelNLPSpeechRecommeder System
Jinyu Li
Jinyu Li
Partner Applied Science Manager, Microsoft
Acoustic ModelingSpeech RecognitionSpeech Translation
J
Jiaming Zhou
Nankai University
Haoqin Sun
Haoqin Sun
Nankai University
Affective computingSpeech signal processingAudio understanding
Y
Yan Lu
Microsoft Corporation
Y
Yong Qin
Nankai University