SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

Existing speech quality assessment methods rely heavily on scalar scores or binary decisions, suffering from poor interpretability and limited generalization across tasks and languages. Method: We propose the “SpeechLLM-as-Judges” paradigm, introducing SpeechEval—a large-scale, multilingual speech evaluation dataset—and SQ-LLM, a structured large language model for speech quality assessment. SQ-LLM incorporates chain-of-thought reasoning, reward optimization, and structured prompt learning. Contribution/Results: SQ-LLM is the first model to enable unified, multi-granularity speech evaluation—including fine-grained scoring, pairwise comparison, actionable improvement suggestions, and deepfake detection—within a single framework. It maintains robust zero-shot cross-lingual transfer performance. Experiments demonstrate significant gains in assessment transparency, generalizability, and task adaptability. This work establishes a novel, interpretable, and scalable paradigm for speech quality evaluation.

Technology Category

Application Category

📝 Abstract

Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.

Problem

Research questions and friction points this paper is trying to address.

Evaluating perceptual quality of synthetic speech lacks interpretability

Existing methods have poor generalization across tasks and languages

Developing structured evaluation using large language models for speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs conduct structured speech quality evaluation

SpeechEval dataset with multilingual annotations across tasks

SQ-LLM trained with chain-of-thought reasoning optimization

🔎 Similar Papers

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation