Less is More: Data Curation Matters in Scaling Speech Enhancement

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In speech enhancement, scaling up training data yields diminishing performance gains, primarily due to poor label fidelity—especially the widespread mislabeling of noisy utterances as “clean”—in large-scale datasets. This challenges the conventional assumption that “more data is always better.” Method: We systematically investigate the trade-off between data quality and quantity, revealing that label purity (i.e., ground-truth cleanliness) exerts a stronger influence on model performance than dataset size. To address this, we propose a data curation strategy grounded in SNR estimation and multi-dimensional quality screening to construct a high-fidelity subset. Contribution/Results: Experiments demonstrate that a model trained on merely 700 hours of curated data surpasses a baseline trained on 2,500 hours of raw, uncurated data. This work provides the first empirical validation in speech enhancement that high-quality small-scale data outperforms low-quality large-scale data, establishing data curation—not mere data scaling—as a critical optimization lever and introducing a new paradigm for efficient data engineering.

Technology Category

Application Category

📝 Abstract
The vast majority of modern speech enhancement systems rely on data-driven neural network models. Conventionally, larger datasets are presumed to yield superior model performance, an observation empirically validated across numerous tasks in other domains. However, recent studies reveal diminishing returns when scaling speech enhancement data. We focus on a critical factor: prevalent quality issues in ``clean'' training labels within large-scale datasets. This work re-examines this phenomenon and demonstrates that, within large-scale training sets, prioritizing high-quality training data is more important than merely expanding the data volume. Experimental findings suggest that models trained on a carefully curated subset of 700 hours can outperform models trained on the 2,500-hour full dataset. This outcome highlights the crucial role of data curation in scaling speech enhancement systems effectively.
Problem

Research questions and friction points this paper is trying to address.

Addresses quality issues in clean speech training data
Challenges the assumption that larger datasets always improve performance
Demonstrates curated small datasets outperform larger uncurated ones
Innovation

Methods, ideas, or system contributions that make the work stand out.

Prioritize high-quality training data over volume
Carefully curated 700-hour subset outperforms 2,500-hour dataset
Data curation crucial for effective speech enhancement scaling
🔎 Similar Papers
No similar papers found.