🤖 AI Summary
In speech enhancement, scaling up training data yields diminishing performance gains, primarily due to poor label fidelity—especially the widespread mislabeling of noisy utterances as “clean”—in large-scale datasets. This challenges the conventional assumption that “more data is always better.”
Method: We systematically investigate the trade-off between data quality and quantity, revealing that label purity (i.e., ground-truth cleanliness) exerts a stronger influence on model performance than dataset size. To address this, we propose a data curation strategy grounded in SNR estimation and multi-dimensional quality screening to construct a high-fidelity subset.
Contribution/Results: Experiments demonstrate that a model trained on merely 700 hours of curated data surpasses a baseline trained on 2,500 hours of raw, uncurated data. This work provides the first empirical validation in speech enhancement that high-quality small-scale data outperforms low-quality large-scale data, establishing data curation—not mere data scaling—as a critical optimization lever and introducing a new paradigm for efficient data engineering.
📝 Abstract
The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks in other domains. However, recent studies reveal diminishing returns when scaling speech enhancement data. We focus on a critical factor: prevalent quality issues in ``clean'' training labels within large-scale datasets. This work re-examines this phenomenon and demonstrates that, within large-scale training sets, prioritizing high-quality training data is more important than merely expanding the data volume. Experimental findings suggest that models trained on a carefully curated subset of 700 hours can outperform models trained on the 2,500-hour full dataset. This outcome highlights the crucial role of data curation in scaling speech enhancement systems effectively.