Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

Current pretraining corpus quality filtering mechanisms rely on classifiers to assess educational value but exhibit high sensitivity to superficial document formatting. This work systematically demonstrates, for the first time, that mainstream filtering models—such as FineWeb-Edu’s Content Quality Filter (CQF)—are vulnerable to simple Wikipedia-style reformatting operations: approximately 7% of low-quality documents are misclassified as high-quality after such perturbations, thereby evading filtration. Through adversarial reformatting strategies and quantitative analysis, the study challenges the prevailing assumption that a single classifier can reliably curate pretraining data, revealing critical vulnerabilities in existing data-cleaning pipelines.

📝 Abstract

Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

Problem

Research questions and friction points this paper is trying to address.

Classifier-Based Quality Filtering

Pre-training Corpora

Wikipedia-Style Reformatting

Quality Assessment

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier-Based Quality Filtering

pre-training corpus

Wikipedia-style reformatting