🤖 AI Summary
To address the prevalent issue of low-quality training data and heavy reliance on massive, noisy textual corpora in large language model (LLM) pretraining, this paper introduces the first LLM-driven, line-level fine-grained data filtering framework. Leveraging GPT-4o mini, we perform human-calibrated annotation on FineWeb samples to construct a high-quality, interpretable label taxonomy covering nine distinct low-quality patterns—including logical contradictions, factual inaccuracies, and redundant repetitions. We then train a DeBERTa-v3 classifier to enable efficient, automated cleaning of a 10B-token corpus. Experiments on HellaSwag demonstrate that models trained on the filtered data achieve significantly higher accuracy; notably, a GPT-2 model attains equivalent performance using only 75% of the original training steps—effectively reducing required data volume by 25%. To foster reproducibility and further research, we publicly release the FinerWeb-10BT annotated dataset and all source code.
📝 Abstract
Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.