๐ค AI Summary
Current pretraining corpus quality filtering mechanisms rely on classifiers to assess educational value but exhibit high sensitivity to superficial document formatting. This work systematically demonstrates, for the first time, that mainstream filtering modelsโsuch as FineWeb-Eduโs Content Quality Filter (CQF)โare vulnerable to simple Wikipedia-style reformatting operations: approximately 7% of low-quality documents are misclassified as high-quality after such perturbations, thereby evading filtration. Through adversarial reformatting strategies and quantitative analysis, the study challenges the prevailing assumption that a single classifier can reliably curate pretraining data, revealing critical vulnerabilities in existing data-cleaning pipelines.
๐ Abstract
Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.