A Study on Zero-shot Non-intrusive Speech Assessment using Large Language Models

๐Ÿ“… 2024-09-16
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

200K/year
๐Ÿค– AI Summary
This study addresses the challenge of non-intrusive speech quality and ASR accuracy assessment in the absence of reference speech samples. We propose GPT-Whisper, a zero-shot end-to-end framework that leverages Whisper-generated transcriptions as input and employs directed prompt engineering to elicit naturalness, intelligibility, and Character Error Rate (CER) predictions from GPT-4oโ€”requiring no training data or audio modeling. Our key contribution is the paradigm shift from audio-dependent evaluation to fully text-based, controllable semantic speech assessmentโ€”a first in the field. Experiments demonstrate that GPT-Whisper outperforms supervised models MOS-SSL and MTI-Net in CER prediction, surpasses SpeechLMScore and DNSMOS in intelligibility estimation, and achieves moderate yet statistically significant correlation with human ratings (Spearman ฯ โ‰ˆ 0.5โ€“0.6).

Technology Category

Application Category

๐Ÿ“ Abstract
This work investigates two strategies for zero-shot non-intrusive speech assessment leveraging large language models. First, we explore the audio analysis capabilities of GPT-4o. Second, we propose GPT-Whisper, which uses Whisper as an audio-to-text module and evaluates the naturalness of text via targeted prompt engineering. We evaluate the assessment metrics predicted by GPT-4o and GPT-Whisper, examining their correlation with human-based quality and intelligibility assessments and the character error rate (CER) of automatic speech recognition. Experimental results show that GPT-4o alone is less effective for audio analysis, while GPT-Whisper achieves higher prediction accuracy, has moderate correlation with speech quality and intelligibility, and has higher correlation with CER. Compared to SpeechLMScore and DNSMOS, GPT-Whisper excels in intelligibility metrics, but performs slightly worse than SpeechLMScore in quality estimation. Furthermore, GPT-Whisper outperforms supervised non-intrusive models MOS-SSL and MTI-Net in Spearman's rank correlation for CER of Whisper. These findings validate GPT-Whisper's potential for zero-shot speech assessment without requiring additional training data.
Problem

Research questions and friction points this paper is trying to address.

Speech Quality Evaluation
Text-to-Speech Accuracy
Large-scale Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-Whisper
Speech Quality Evaluation
Unsupervised Assessment
๐Ÿ”Ž Similar Papers
No similar papers found.