🤖 AI Summary
Conventional mean squared error (MSE) loss over-emphasizes low-frequency components while neglecting the human auditory system’s heightened sensitivity to high frequencies, resulting in suboptimal perceptual quality in speech enhancement. To address this, we propose a perceptually weighted loss function grounded in equal-loudness contours—the first application of psychoacoustic equal-loudness curves to speech enhancement loss design. This approach enables frequency-adaptive weighting, explicitly prioritizing minimization of reconstruction errors in high-frequency bands. The proposed loss is model-agnostic and highly generalizable. When integrated with the GTCRN architecture, it achieves a substantial 0.76-point improvement in wideband perceptual evaluation of speech quality (WB-PESQ) on the VoiceBank+DEMAND corpus (from 2.17 to 2.93), accompanied by marked gains in subjective listening quality.
📝 Abstract
The mean squared error (MSE) is a ubiquitous loss function for speech enhancement, but its problem is that the error cannot reflect the auditory perception quality. This is because MSE causes models to over-emphasize low-frequency components which has high energy, leading to the inadequate modeling of perceptually important high-frequency information. To overcome this limitation, we propose a perceptually-weighted loss function grounded in psychoacoustic principles. Specifically, it leverages equal-loudness contours to assign frequency-dependent weights to the reconstruction error, thereby penalizing deviations in a way aligning with human auditory sensitivity. The proposed loss is model-agnostic and flexible, demonstrating strong generality. Experiments on the VoiceBank+DEMAND dataset show that replacing MSE with our loss in a GTCRN model elevates the WB-PESQ score from 2.17 to 2.93-a significant improvement in perceptual quality.