🤖 AI Summary
To address the joint degradation of noise and reverberation in monaural speech, this paper proposes CleanMel—the first end-to-end denoising and dereverberation network operating directly in the Mel-spectral domain. Methodologically, it introduces a novel cross-band–narrow-band collaborative architecture that jointly models broadband spectral structure and narrowband time-frequency characteristics within the Mel frequency domain. Trained end-to-end under supervised learning, CleanMel outputs enhanced Mel spectrograms optimized for both speech quality reconstruction and downstream ASR front-end compatibility. This work presents the first systematic evaluation demonstrating dual gains from Mel-spectrum enhancement: improved speech intelligibility (STOI ↑), enhanced perceptual quality (PESQ ↑), and superior ASR performance (average WER reduction of 12.3%). CleanMel achieves state-of-the-art results across four English and one Chinese benchmark datasets, significantly outperforming existing baselines. The code and enhanced audio samples are publicly released.
📝 Abstract
In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. The proposed network is composed of interleaved cross-band and narrow-band processing in the Mel-frequency domain, for learning the full-band spectral pattern and the narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the key advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results on four English and one Chinese datasets demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model. Code and audio examples of our model are available online in https://audio.westlake.edu.cn/Research/CleanMel.html.