SpeechRefiner: Towards Perceptual Quality Refinement for Front-End Algorithms

πŸ“… 2025-06-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing speech front-ends (e.g., noise reduction, dereverberation, source separation) often leave residual distortions or introduce perceptually salient artifacts, degrading subjective listening qualityβ€”yet conventional objective metrics (e.g., SI-SNR) fail to correlate well with such perceptual impairments. To address this, we propose SpeechRefiner, the first conditional flow matching (CFM)-based post-processing framework explicitly designed for perceptual speech quality enhancement. Its key contributions are: (i) the first application of CFM to waveform-level speech post-processing, enabling end-to-end distortion modeling; (ii) multi-distortion joint training, yielding strong generalization across diverse front-end algorithms and noise types; and (iii) seamless integration into industrial pipelines. Experiments demonstrate significant improvements in PESQ (+1.2), STOI (+0.08), and subjective MOS (+0.8), without retraining for specific front-ends or noise conditions. Code and audio demos are publicly available.

Technology Category

Application Category

πŸ“ Abstract
Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To address this, we introduce SpeechRefiner, a post-processing tool that utilizes Conditional Flow Matching (CFM) to improve the perceptual quality of speech. In this study, we benchmark SpeechRefiner against recent task-specific refinement methods and evaluate its performance within our internal processing pipeline, which integrates multiple front-end algorithms. Experiments show that SpeechRefiner exhibits strong generalization across diverse impairment sources, significantly enhancing speech perceptual quality. Audio demos can be found at https://speechrefiner.github.io/SpeechRefiner/.
Problem

Research questions and friction points this paper is trying to address.

Improves perceptual quality of speech post-processing
Addresses residual noise and artifacts from front-end algorithms
Generalizes across diverse speech impairment sources
Innovation

Methods, ideas, or system contributions that make the work stand out.

Post-processing tool using Conditional Flow Matching
Improves perceptual quality of speech
Generalizes across diverse impairment sources
πŸ”Ž Similar Papers
No similar papers found.
S
Sirui Li
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
S
Shuai Wang
School of Intelligence Science and Technology, Nanjing University, Suzhou, China; Shenzhen Research Institute of Big Data, Shenzhen, China
Zhijun Liu
Zhijun Liu
School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
Z
Zhongjie Jiang
Tencent Ethereal Audio Lab, Tencent, Shenzhen, China
Yannan Wang
Yannan Wang
University of Science and Technology of China
Speech SeparationSpeech EnhancementDeep LearningLanguage recognition
Haizhou Li
Haizhou Li
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China; NUS, Singapore
Automatic Speech RecognitionSpeaker RecognitionLanguage RecognitionVoice ConversionMachine Translation