A Bitter Lesson for Data Filtering

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This study challenges the prevailing assumption that pretraining data must undergo rigorous quality filtering, particularly in high-compute, data-scarce regimes. Through large-scale language model pretraining and extensive scaling experiments, the authors systematically evaluate the impact of various data filtering strategies. Their findings reveal that, given sufficient computational resources, large-parameter models not only tolerate low-quality or noisy data but can actually benefit from it, with the unfiltered data strategy consistently yielding the best performance across downstream tasks. These results suggest that stringent data quality filtering may be unnecessary under conditions of adequate training scale, offering a new perspective on data curation strategies for large language models.
📝 Abstract
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.
Problem

Research questions and friction points this paper is trying to address.

data filtering
large model pretraining
high-quality data
low-quality data
scaling studies
Innovation

Methods, ideas, or system contributions that make the work stand out.

data filtering
large language models
pretraining
scaling laws
low-quality data
🔎 Similar Papers
2024-06-27Journal of Mathematical & Computer ApplicationsCitations: 2