🤖 AI Summary
Real-world speech is often degraded by multiple concurrent distortions—such as noise, reverberation, compression artifacts, and phase distortions—whereas most existing speech enhancement (SE) methods target only single distortions, limiting their generalizability. To address this, we propose Universal Discrete-domain Speech Enhancement (UDSE), the first SE framework formulated as a discrete token classification task. UDSE leverages a pretrained Residual Vector Quantization (RVQ) speech codec to vector-quantize clean speech into compact token sequences; it then reconstructs speech via autoregressive prediction of residual VQ tokens. The method integrates global feature extraction, teacher-forcing training, and cross-entropy optimization. Experiments demonstrate that UDSE consistently outperforms state-of-the-art regression-based SE methods under both single- and multi-distortion conditions. Notably, it exhibits exceptional robustness and generalization to unconventional distortions—including phase and compression artifacts—thereby enhancing the practicality and universality of SE in realistic acoustic environments.
📝 Abstract
In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.