The data-quality illusion: Rethinking Classifier-based quality filtering for LLM Pretraining

📅 2025-10-01

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

This work challenges the assumption that classifier-guided quality filtering (CQF) genuinely captures intrinsic data quality. We observe that while CQF improves downstream task performance, it fails to enhance language modeling capability on high-quality data—and even implicitly discards high-quality samples—revealing a “data quality illusion.” Methodologically, we propose a binary-classifier-based quality scoring framework and construct a synthetic dataset with progressively degraded quality via random token shuffling, enabling systematic analysis of training dynamics relative to ground-truth quality. Our key contribution is the first empirical demonstration that CQF’s efficacy stems from biased distributional shifts—not genuine quality awareness. Through controlled synthetic-data ablations and counterfactual experiments, we critically undermine CQF’s validity as a proxy for data quality, establishing new benchmarks and prompting fundamental reconsideration of quality evaluation in data curation.

Technology Category

Application Category

📝 Abstract

Large-scale models are pretrained on massive web-crawled datasets containing documents of mixed quality, making data filtering essential. A popular method is Classifier-based Quality Filtering (CQF), which trains a binary classifier to distinguish between pretraining data and a small, high-quality set. It assigns each pretraining document a quality score defined as the classifier's score and retains only the top-scoring ones. We provide an in-depth analysis of CQF. We show that while CQF improves downstream task performance, it does not necessarily enhance language modeling on the high-quality dataset. We explain this paradox by the fact that CQF implicitly filters the high-quality dataset as well. We further compare the behavior of models trained with CQF to those trained on synthetic data of increasing quality, obtained via random token permutations, and find starkly different trends. Our results challenge the view that CQF captures a meaningful notion of data quality.

Problem

Research questions and friction points this paper is trying to address.

Analyzing classifier-based filtering effectiveness for LLM pretraining data

Challenging whether quality scores reflect meaningful data quality

Comparing filtering methods with synthetic data quality trends

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-based filtering scores pretraining data quality

Implicit filtering affects high-quality dataset language modeling

Synthetic data comparison reveals different quality trends

🔎 Similar Papers

Improving Pretraining Data Using Perplexity Correlations