🤖 AI Summary
This study challenges the prevailing assumption that pretraining data must undergo rigorous quality filtering, particularly in high-compute, data-scarce regimes. Through large-scale language model pretraining and extensive scaling experiments, the authors systematically evaluate the impact of various data filtering strategies. Their findings reveal that, given sufficient computational resources, large-parameter models not only tolerate low-quality or noisy data but can actually benefit from it, with the unfiltered data strategy consistently yielding the best performance across downstream tasks. These results suggest that stringent data quality filtering may be unnecessary under conditions of adequate training scale, offering a new perspective on data curation strategies for large language models.
📝 Abstract
We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.