QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions

📅 2025-03-26

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing speech quality assessment datasets lack fine-grained natural language annotations, hindering explainable, multi-dimensional auditory understanding and generation. To address this, we introduce the first natural language description dataset targeting low-level speech distortions, covering 11 canonical noise and distortion types, with each sample accompanied by structured textual annotations incorporating causal reasoning. We pioneer the integration of explainable natural language inference into speech quality evaluation, establishing a benchmark framework for assessing fine-grained comprehension and generation capabilities of auditory large language models (Auditory LLMs). Our approach jointly leverages speech signal analysis, multi-dimensional semantic annotation, and structured prompt engineering to fine-tune Auditory LLMs. Experiments demonstrate that the model accurately identifies distortion types and temporal characteristics while generating high-quality, causally grounded descriptions—substantially improving interpretability and reliability of speech quality assessment.

Technology Category

Application Category

📝 Abstract

This paper explores a novel perspective to speech quality assessment by leveraging natural language descriptions, offering richer, more nuanced insights than traditional numerical scoring methods. Natural language feedback provides instructive recommendations and detailed evaluations, yet existing datasets lack the comprehensive annotations needed for this approach. To bridge this gap, we introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset encompassing 11 key aspects and detailed natural language comments that include reasoning and contextual insights. Additionally, we propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models (LLMs). Experimental results demonstrate that finetuned auditory LLMs can reliably generate detailed descriptions of noise and distortion, effectively identifying their types and temporal characteristics. The results further highlight the potential for incorporating reasoning to enhance the accuracy and reliability of quality assessments. The dataset will be released at https://huggingface.co/datasets/tsinghua-ee/QualiSpeech.

Problem

Research questions and friction points this paper is trying to address.

Develops speech quality assessment using natural language descriptions

Addresses lack of annotated datasets for detailed speech evaluations

Evaluates auditory LLMs' capability to describe noise and distortions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging natural language descriptions for speech assessment

Introducing QualiSpeech dataset with 11 key aspects

Finetuning auditory LLMs for noise and distortion identification

🔎 Similar Papers

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation