A Bitter Lesson for Data Filtering

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study challenges the prevailing assumption that pretraining data must undergo rigorous quality filtering, particularly in high-compute, data-scarce regimes. Through large-scale language model pretraining and extensive scaling experiments, the authors systematically evaluate the impact of various data filtering strategies. Their findings reveal that, given sufficient computational resources, large-parameter models not only tolerate low-quality or noisy data but can actually benefit from it, with the unfiltered data strategy consistently yielding the best performance across downstream tasks. These results suggest that stringent data quality filtering may be unnecessary under conditions of adequate training scale, offering a new perspective on data curation strategies for large language models.

📝 Abstract

We investigate data filtering for large model pretraining via new scaling studies that target the high compute, data-scarce regime. In spite of an apparently common belief that filtering data to include only high-quality information is essential, our experiments suggest that with enough compute, the best data filter is no data filter. We find that sufficiently trained large parameter models not only tolerate low-quality and distractor data, but in fact benefit from nominally ``poor'' data.

Problem

Research questions and friction points this paper is trying to address.

data filtering

large model pretraining

high-quality data

low-quality data

scaling studies

Innovation

Methods, ideas, or system contributions that make the work stand out.

data filtering

large language models

pretraining