🤖 AI Summary
This work addresses the challenge of semantic-level hallucination errors—such as phoneme omission or speaker inconsistency—introduced by generative speech enhancement models during denoising, which are poorly captured by conventional non-intrusive speech quality metrics. To this end, the paper proposes a novel non-intrusive confidence estimation method based on the log-probabilities of generated discrete tokens. This approach leverages discrete token modeling to detect hallucination artifacts that evade traditional evaluation techniques. The resulting confidence scores exhibit strong correlation with intrusive enhancement metrics and effectively identify low-quality samples for removal. When applied to in-the-wild TTS datasets, the method significantly improves data cleaning quality and yields notable performance gains in downstream text-to-speech tasks.
📝 Abstract
Generative speech enhancement (GSE) models show great promise in producing high-quality clean speech from noisy inputs, enabling applications such as curating noisy text-to-speech (TTS) datasets into high-quality ones. However, GSE models are prone to hallucination errors, such as phoneme omissions and speaker inconsistency, which conventional error filtering based on non-intrusive speech quality metrics often fails to detect. To address this issue, we propose a non-intrusive method for filtering hallucination errors from discrete token-based GSE models. Our method leverages the log-probabilities of generated tokens as confidence scores to detect potential errors. Experimental results show that the confidence scores strongly correlate with a suite of intrusive SE metrics, and that our method effectively identifies hallucination errors missed by conventional filtering methods. Furthermore, we demonstrate the practical utility of our method: curating an in-the-wild TTS dataset with our confidence-based filtering improves the performance of subsequently trained TTS models.