Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

๐Ÿ“… 2026-05-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

173K/year
๐Ÿค– AI Summary
Current pretraining corpus quality filtering mechanisms rely on classifiers to assess educational value but exhibit high sensitivity to superficial document formatting. This work systematically demonstrates, for the first time, that mainstream filtering modelsโ€”such as FineWeb-Eduโ€™s Content Quality Filter (CQF)โ€”are vulnerable to simple Wikipedia-style reformatting operations: approximately 7% of low-quality documents are misclassified as high-quality after such perturbations, thereby evading filtration. Through adversarial reformatting strategies and quantitative analysis, the study challenges the prevailing assumption that a single classifier can reliably curate pretraining data, revealing critical vulnerabilities in existing data-cleaning pipelines.
๐Ÿ“ Abstract
Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.
Problem

Research questions and friction points this paper is trying to address.

Classifier-Based Quality Filtering
Pre-training Corpora
Wikipedia-Style Reformatting
Quality Assessment
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-Based Quality Filtering
pre-training corpus
Wikipedia-style reformatting
quality assessment vulnerability
document filtering
๐Ÿ”Ž Similar Papers
No similar papers found.