Improving Speech Enhancement with Multi-Metric Supervision from Learned Quality Assessment

📅 2025-06-13

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Traditional speech enhancement (SE) methods rely on objective metrics such as SI-SNR, suffering from poor alignment with perceptual quality, weak cross-metric generalization, and dependence on clean reference signals—limiting applicability in real-world scenarios. This paper proposes a novel end-to-end SE training paradigm guided by a learned Speech Quality Assessment (SQA) model: it replaces conventional loss functions with a multi-metric joint-optimized, trainable SQA model as the supervisory signal; introduces a multi-task quality prediction network jointly regressing SI-SNR, STOI, PESQ, and ESTOI; and incorporates an unsupervised adaptation strategy leveraging real-world noisy data. To our knowledge, this is the first work to deeply embed SQA in a closed-loop manner within SE training, overcoming three key bottlenecks: misalignment between optimization objectives and auditory perception, insufficient generalization, and reliance on ideal clean references. Experiments demonstrate significant improvements in PESQ, ESTOI, and SI-SNR under both simulated and real noise, enhanced cross-dataset generalization, and full independence from clean speech references.

Technology Category

Application Category

📝 Abstract

Speech quality assessment (SQA) aims to predict the perceived quality of speech signals under a wide range of distortions. It is inherently connected to speech enhancement (SE), which seeks to improve speech quality by removing unwanted signal components. While SQA models are widely used to evaluate SE performance, their potential to guide SE training remains underexplored. In this work, we investigate a training framework that leverages a SQA model, trained to predict multiple evaluation metrics from a public SE leaderboard, as a supervisory signal for SE. This approach addresses a key limitation of conventional SE objectives, such as SI-SNR, which often fail to align with perceptual quality and generalize poorly across evaluation metrics. Moreover, it enables training on real-world data where clean references are unavailable. Experiments on both simulated and real-world test sets show that SQA-guided training consistently improves performance across a range of quality metrics.

Problem

Research questions and friction points this paper is trying to address.

Leveraging SQA models to guide speech enhancement training

Addressing misalignment between conventional SE objectives and perceptual quality

Enabling SE training without clean references in real-world data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses SQA model for multi-metric supervision

Trains SE with perceptual quality alignment

Enables training without clean references

🔎 Similar Papers

No similar papers found.