Confidence-based Filtering for Speech Dataset Curation with Generative Speech Enhancement Using Discrete Tokens

📅 2026-01-18

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the challenge of semantic-level hallucination errors—such as phoneme omission or speaker inconsistency—introduced by generative speech enhancement models during denoising, which are poorly captured by conventional non-intrusive speech quality metrics. To this end, the paper proposes a novel non-intrusive confidence estimation method based on the log-probabilities of generated discrete tokens. This approach leverages discrete token modeling to detect hallucination artifacts that evade traditional evaluation techniques. The resulting confidence scores exhibit strong correlation with intrusive enhancement metrics and effectively identify low-quality samples for removal. When applied to in-the-wild TTS datasets, the method significantly improves data cleaning quality and yields notable performance gains in downstream text-to-speech tasks.

Technology Category

Application Category

📝 Abstract

Generative speech enhancement (GSE) models show great promise in producing high-quality clean speech from noisy inputs, enabling applications such as curating noisy text-to-speech (TTS) datasets into high-quality ones. However, GSE models are prone to hallucination errors, such as phoneme omissions and speaker inconsistency, which conventional error filtering based on non-intrusive speech quality metrics often fails to detect. To address this issue, we propose a non-intrusive method for filtering hallucination errors from discrete token-based GSE models. Our method leverages the log-probabilities of generated tokens as confidence scores to detect potential errors. Experimental results show that the confidence scores strongly correlate with a suite of intrusive SE metrics, and that our method effectively identifies hallucination errors missed by conventional filtering methods. Furthermore, we demonstrate the practical utility of our method: curating an in-the-wild TTS dataset with our confidence-based filtering improves the performance of subsequently trained TTS models.

Problem

Research questions and friction points this paper is trying to address.

generative speech enhancement

hallucination errors

confidence-based filtering

speech dataset curation

discrete tokens

Innovation

Methods, ideas, or system contributions that make the work stand out.

confidence-based filtering

generative speech enhancement

discrete tokens