🤖 AI Summary
This study addresses the challenging problem of “semantic hallucinations” in generative speech models—high-fidelity synthetic speech exhibiting high phoneme accuracy yet conveying semantically nonsensical content (e.g., grammatical errors or logical contradictions). To overcome the limitations of existing non-intrusive evaluation metrics in modeling semantic plausibility, we propose the first unsupervised, large language model (LLM)-based framework for assessing speech semantic coherence and logical consistency: ASR transcripts are jointly scored by multiple LLMs to quantify semantic validity. Our contributions are threefold: (1) We construct and publicly release NonsenseSpeech-1K, the first high-quality dataset of synthetic nonsensical speech; (2) We systematically demonstrate the severe inadequacy of mainstream non-intrusive metrics—including MOS, WER, and BERTScore—in detecting semantic hallucinations; (3) We empirically validate that our LLM-based evaluator achieves high sensitivity to semantic anomalies under reference-free conditions, establishing a novel paradigm for trustworthy speech generation evaluation.
📝 Abstract
Significant research efforts are currently being dedicated to non-intrusive quality and intelligibility assessment, especially given how it enables curation of large scale datasets of in-the-wild speech data. However, with the increasing capabilities of generative models to synthesize high quality speech, new types of artifacts become relevant, such as generative hallucinations. While intrusive metrics are able to spot such sort of discrepancies from a reference signal, it is not clear how current non-intrusive methods react to high-quality phoneme confusions or, more extremely, gibberish speech. In this paper we explore how to factor in this aspect under a fully unsupervised setting by leveraging language models. Additionally, we publish a dataset of high-quality synthesized gibberish speech for further development of measures to assess implausible sentences in spoken language, alongside code for calculating scores from a variety of speech language models.