Universal Discrete-Domain Speech Enhancement

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Real-world speech is often degraded by multiple concurrent distortions—such as noise, reverberation, compression artifacts, and phase distortions—whereas most existing speech enhancement (SE) methods target only single distortions, limiting their generalizability. To address this, we propose Universal Discrete-domain Speech Enhancement (UDSE), the first SE framework formulated as a discrete token classification task. UDSE leverages a pretrained Residual Vector Quantization (RVQ) speech codec to vector-quantize clean speech into compact token sequences; it then reconstructs speech via autoregressive prediction of residual VQ tokens. The method integrates global feature extraction, teacher-forcing training, and cross-entropy optimization. Experiments demonstrate that UDSE consistently outperforms state-of-the-art regression-based SE methods under both single- and multi-distortion conditions. Notably, it exhibits exceptional robustness and generalization to unconventional distortions—including phase and compression artifacts—thereby enhancing the practicality and universality of SE in realistic acoustic environments.

Technology Category

Application Category

📝 Abstract
In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.
Problem

Research questions and friction points this paper is trying to address.

Addressing speech enhancement under multiple simultaneous distortions
Proposing discrete-domain classification instead of regression models
Improving generalization for real-world noise and distortion combinations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates speech enhancement as discrete token classification
Uses residual vector quantizer tokens from neural codec
Predicts clean tokens sequentially following RVQ hierarchy
🔎 Similar Papers
No similar papers found.
F
Fei Liu
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, China
Yang Ai
Yang Ai
Associate Researcher, University of Science and Technology of China
Speech SynthesisSpeech EnhancementSpeech CodingDeep Learning
Y
Ye-Xin Lu
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, China
R
Rui-Chen Zheng
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, China
H
Hui-Peng Du
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, China
Z
Zhen-Hua Ling
National Engineering Research Center of Speech and Language Information Processing, University of Science and Technology of China, Hefei, 230027, China