Are These Even Words? Quantifying the Gibberishness of Generative Speech Models

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenging problem of “semantic hallucinations” in generative speech models—high-fidelity synthetic speech exhibiting high phoneme accuracy yet conveying semantically nonsensical content (e.g., grammatical errors or logical contradictions). To overcome the limitations of existing non-intrusive evaluation metrics in modeling semantic plausibility, we propose the first unsupervised, large language model (LLM)-based framework for assessing speech semantic coherence and logical consistency: ASR transcripts are jointly scored by multiple LLMs to quantify semantic validity. Our contributions are threefold: (1) We construct and publicly release NonsenseSpeech-1K, the first high-quality dataset of synthetic nonsensical speech; (2) We systematically demonstrate the severe inadequacy of mainstream non-intrusive metrics—including MOS, WER, and BERTScore—in detecting semantic hallucinations; (3) We empirically validate that our LLM-based evaluator achieves high sensitivity to semantic anomalies under reference-free conditions, establishing a novel paradigm for trustworthy speech generation evaluation.

Technology Category

Application Category

📝 Abstract
Significant research efforts are currently being dedicated to non-intrusive quality and intelligibility assessment, especially given how it enables curation of large scale datasets of in-the-wild speech data. However, with the increasing capabilities of generative models to synthesize high quality speech, new types of artifacts become relevant, such as generative hallucinations. While intrusive metrics are able to spot such sort of discrepancies from a reference signal, it is not clear how current non-intrusive methods react to high-quality phoneme confusions or, more extremely, gibberish speech. In this paper we explore how to factor in this aspect under a fully unsupervised setting by leveraging language models. Additionally, we publish a dataset of high-quality synthesized gibberish speech for further development of measures to assess implausible sentences in spoken language, alongside code for calculating scores from a variety of speech language models.
Problem

Research questions and friction points this paper is trying to address.

Quantifying gibberish speech artifacts in generative models
Assessing non-intrusive methods for phoneme confusion detection
Developing unsupervised metrics for implausible synthesized speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging language models for unsupervised gibberish detection
Creating dataset of synthesized gibberish speech artifacts
Developing metrics for assessing implausible spoken sentences
🔎 Similar Papers
No similar papers found.