🤖 AI Summary
Existing speech quality assessment methods rely heavily on scalar scores or binary decisions, suffering from poor interpretability and limited generalization across tasks and languages.
Method: We propose the “SpeechLLM-as-Judges” paradigm, introducing SpeechEval—a large-scale, multilingual speech evaluation dataset—and SQ-LLM, a structured large language model for speech quality assessment. SQ-LLM incorporates chain-of-thought reasoning, reward optimization, and structured prompt learning.
Contribution/Results: SQ-LLM is the first model to enable unified, multi-granularity speech evaluation—including fine-grained scoring, pairwise comparison, actionable improvement suggestions, and deepfake detection—within a single framework. It maintains robust zero-shot cross-lingual transfer performance. Experiments demonstrate significant gains in assessment transparency, generalizability, and task adaptability. This work establishes a novel, interpretable, and scalable paradigm for speech quality evaluation.
📝 Abstract
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.