🤖 AI Summary
This paper addresses long-overlooked bottlenecks in speech enhancement (SE): (1) bandwidth mismatch and implicit label noise in training corpora; (2) insufficient robustness under extreme conditions (e.g., speaker overlap, high noise/reverberation) and lack of quantifiable metrics for hard samples; and (3) poor correlation between single objective metrics and subjective perceptual quality. We propose a data quality diagnostic framework with bandwidth consistency verification, revealing—for the first time—systematic effective bandwidth deviations and >15% label noise across mainstream SE corpora. Furthermore, we introduce a difficulty-aware, multi-metric fusion evaluation framework that integrates objective measures with MOS-mapped weighted aggregation. Experiments demonstrate a 32% improvement in Pearson correlation (r) between automatic assessment and human judgments, significantly enhancing the reliability and interpretability of SE system development.
📝 Abstract
The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various"high-quality"speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.