Loud-loss: A Perceptually Motivated Loss Function for Speech Enhancement Based on Equal-Loudness Contours

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional mean squared error (MSE) loss over-emphasizes low-frequency components while neglecting the human auditory system’s heightened sensitivity to high frequencies, resulting in suboptimal perceptual quality in speech enhancement. To address this, we propose a perceptually weighted loss function grounded in equal-loudness contours—the first application of psychoacoustic equal-loudness curves to speech enhancement loss design. This approach enables frequency-adaptive weighting, explicitly prioritizing minimization of reconstruction errors in high-frequency bands. The proposed loss is model-agnostic and highly generalizable. When integrated with the GTCRN architecture, it achieves a substantial 0.76-point improvement in wideband perceptual evaluation of speech quality (WB-PESQ) on the VoiceBank+DEMAND corpus (from 2.17 to 2.93), accompanied by marked gains in subjective listening quality.

Technology Category

Application Category

📝 Abstract
The mean squared error (MSE) is a ubiquitous loss function for speech enhancement, but its problem is that the error cannot reflect the auditory perception quality. This is because MSE causes models to over-emphasize low-frequency components which has high energy, leading to the inadequate modeling of perceptually important high-frequency information. To overcome this limitation, we propose a perceptually-weighted loss function grounded in psychoacoustic principles. Specifically, it leverages equal-loudness contours to assign frequency-dependent weights to the reconstruction error, thereby penalizing deviations in a way aligning with human auditory sensitivity. The proposed loss is model-agnostic and flexible, demonstrating strong generality. Experiments on the VoiceBank+DEMAND dataset show that replacing MSE with our loss in a GTCRN model elevates the WB-PESQ score from 2.17 to 2.93-a significant improvement in perceptual quality.
Problem

Research questions and friction points this paper is trying to address.

MSE loss over-emphasizes low frequencies, neglecting perceptual high-frequency components
Existing loss functions fail to align with human auditory sensitivity characteristics
Speech enhancement requires perceptually-weighted loss based on psychoacoustic principles
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages equal-loudness contours for perceptual weighting
Assigns frequency-dependent weights to reconstruction error
Model-agnostic loss function aligning with human auditory sensitivity
🔎 Similar Papers
No similar papers found.
Zixuan Li
Zixuan Li
Assistant Professor at ICT, UCAS
Knowledge GraphLarge Language Model
Xueliang Zhang
Xueliang Zhang
Inner Mongolia University
Speech enhancementSpeech separationComputational Auditory Scene Analysis
C
Changjiang Zhao
College of Computer Science, Inner Mongolia University, China
S
Shuai Gao
College of Computer Science, Inner Mongolia University, China
L
Lei Miao
Lenovo, China
Z
Zhipeng Yan
Lenovo, China
Y
Ying Sun
Lenovo, China
C
Chong Zhu
Lenovo, China