π€ AI Summary
Existing speech front-ends (e.g., noise reduction, dereverberation, source separation) often leave residual distortions or introduce perceptually salient artifacts, degrading subjective listening qualityβyet conventional objective metrics (e.g., SI-SNR) fail to correlate well with such perceptual impairments. To address this, we propose SpeechRefiner, the first conditional flow matching (CFM)-based post-processing framework explicitly designed for perceptual speech quality enhancement. Its key contributions are: (i) the first application of CFM to waveform-level speech post-processing, enabling end-to-end distortion modeling; (ii) multi-distortion joint training, yielding strong generalization across diverse front-end algorithms and noise types; and (iii) seamless integration into industrial pipelines. Experiments demonstrate significant improvements in PESQ (+1.2), STOI (+0.08), and subjective MOS (+0.8), without retraining for specific front-ends or noise conditions. Code and audio demos are publicly available.
π Abstract
Speech pre-processing techniques such as denoising, de-reverberation, and separation, are commonly employed as front-ends for various downstream speech processing tasks. However, these methods can sometimes be inadequate, resulting in residual noise or the introduction of new artifacts. Such deficiencies are typically not captured by metrics like SI-SNR but are noticeable to human listeners. To address this, we introduce SpeechRefiner, a post-processing tool that utilizes Conditional Flow Matching (CFM) to improve the perceptual quality of speech. In this study, we benchmark SpeechRefiner against recent task-specific refinement methods and evaluate its performance within our internal processing pipeline, which integrates multiple front-end algorithms. Experiments show that SpeechRefiner exhibits strong generalization across diverse impairment sources, significantly enhancing speech perceptual quality. Audio demos can be found at https://speechrefiner.github.io/SpeechRefiner/.