Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

This work addresses key challenges in general-purpose speech enhancement, including suboptimal training targets, poor distortion-perception trade-offs, and imbalanced data quality and scale. To this end, the authors propose using time-shifted anechoic clean speech as the optimization target and introduce a two-stage enhancement framework grounded in distortion-perception trade-off theory. They also conduct a systematic evaluation of how data quality impacts model performance. By employing a high-quality speech data filtering strategy, the proposed approach achieves state-of-the-art results on the URGENT 2025 non-blind test set, significantly improves the quality of text-to-speech (TTS) training data, and demonstrates strong language-agnostic generalization capabilities.

Technology Category

Application Category

📝 Abstract

Universal Speech Enhancement (USE) aims to restore speech quality under diverse degradation conditions while preserving signal fidelity. Despite recent progress, key challenges in training target selection, the distortion--perception tradeoff, and data curation remain unresolved. In this work, we systematically address these three overlooked problems. First, we revisit the conventional practice of using early-reflected speech as the dereverberation target and show that it can degrade perceptual quality and downstream ASR performance. We instead demonstrate that time-shifted anechoic clean speech provides a superior learning target. Second, guided by the distortion--perception tradeoff theory, we propose a simple two-stage framework that achieves minimal distortion under a given level of perceptual quality. Third, we analyze the trade-off between training data scale and quality for USE, revealing that training on large uncurated corpora imposes a performance ceiling, as models struggle to remove subtle artifacts. Our method achieves state-of-the-art performance on the URGENT 2025 non-blind test set and exhibits strong language-agnostic generalization, making it effective for improving TTS training data. Code and models will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Universal Speech Enhancement

training target

distortion-perception tradeoff

data quality

speech restoration

Innovation

Methods, ideas, or system contributions that make the work stand out.

universal speech enhancement

distortion-perception tradeoff

training target redesign