Are These Even Words? Quantifying the Gibberishness of Generative Speech Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This study addresses the challenging problem of “semantic hallucinations” in generative speech models—high-fidelity synthetic speech exhibiting high phoneme accuracy yet conveying semantically nonsensical content (e.g., grammatical errors or logical contradictions). To overcome the limitations of existing non-intrusive evaluation metrics in modeling semantic plausibility, we propose the first unsupervised, large language model (LLM)-based framework for assessing speech semantic coherence and logical consistency: ASR transcripts are jointly scored by multiple LLMs to quantify semantic validity. Our contributions are threefold: (1) We construct and publicly release NonsenseSpeech-1K, the first high-quality dataset of synthetic nonsensical speech; (2) We systematically demonstrate the severe inadequacy of mainstream non-intrusive metrics—including MOS, WER, and BERTScore—in detecting semantic hallucinations; (3) We empirically validate that our LLM-based evaluator achieves high sensitivity to semantic anomalies under reference-free conditions, establishing a novel paradigm for trustworthy speech generation evaluation.

Technology Category

Application Category

📝 Abstract

Significant research efforts are currently being dedicated to non-intrusive quality and intelligibility assessment, especially given how it enables curation of large scale datasets of in-the-wild speech data. However, with the increasing capabilities of generative models to synthesize high quality speech, new types of artifacts become relevant, such as generative hallucinations. While intrusive metrics are able to spot such sort of discrepancies from a reference signal, it is not clear how current non-intrusive methods react to high-quality phoneme confusions or, more extremely, gibberish speech. In this paper we explore how to factor in this aspect under a fully unsupervised setting by leveraging language models. Additionally, we publish a dataset of high-quality synthesized gibberish speech for further development of measures to assess implausible sentences in spoken language, alongside code for calculating scores from a variety of speech language models.

Problem

Research questions and friction points this paper is trying to address.

Quantifying gibberish speech artifacts in generative models

Assessing non-intrusive methods for phoneme confusion detection

Developing unsupervised metrics for implausible synthesized speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging language models for unsupervised gibberish detection

Creating dataset of synthesized gibberish speech artifacts

Developing metrics for assessing implausible spoken sentences

🔎 Similar Papers

Can We Catch the Elephant? A Survey of the Evolvement of Hallucination Evaluation on Natural Language Generation