🤖 AI Summary
To address the limited generalizability of task-specific speech enhancement models, this paper proposes a universal single-channel speech enhancement frontend that operates without downstream task fine-tuning. Instead, it defines a consistency loss in self-supervised representation spaces—such as Wav2Vec 2.0 or HuBERT—and optimizes the enhancement network end-to-end. The key innovation lies in shifting the reconstruction objective from conventional time-frequency domains to semantically rich representation spaces, thereby jointly preserving high-level linguistic semantics and perceptual quality. Experiments demonstrate substantial performance gains across diverse downstream tasks—including automatic speech recognition (ASR), speaker identification, and emotion classification—while achieving higher Mean Opinion Score (MOS) than traditional time-frequency-based methods. These results validate the framework’s strong generalization capability and practical utility.
📝 Abstract
Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task. Thus, generalizing speech enhancement models to unknown downstream tasks has been challenging. This study aims to construct a generic speech enhancement front-end that can improve the performance of back-ends to solve multiple downstream tasks. To this end, we propose a novel training criterion that minimizes the distance between the enhanced and the ground truth clean signal in the feature representation domain of self-supervised learning models. Since self-supervised learning feature representations effectively express high-level speech information useful for solving various downstream tasks, the proposal is expected to make speech enhancement models preserve such information. Experimental validation demonstrates that the proposal improves the performance of multiple speech tasks while maintaining the perceptual quality of the enhanced signal.