Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data

📅 2025-05-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address two key bottlenecks in LLM training data filtering—lagging quality assessment and subjectivity in seed sample selection—this paper proposes an efficient data filtering and validation pipeline. Methodologically, it (1) introduces a lightweight, real-time validation strategy that decouples data quality feedback from model training, and (2) pioneers a quality-hypothesis-driven seed sample self-optimization mechanism to iteratively enhance classifier generalization. The pipeline integrates fastText-based lightweight classification, model-guided filtering, bilingual (English–Chinese) consistency verification, and FineWeb/Chinese FineWeb augmentation. It yields Ultra-FineWeb, a high-quality corpus comprising 1 trillion English tokens and 120 billion Chinese tokens. Experiments demonstrate substantial improvements across multiple downstream LLM benchmarks, over 40% reduction in training cost, and strong robustness with minimal computational overhead.

Technology Category

Application Category

📝 Abstract
Data quality has become a key factor in enhancing model performance with the rapid development of large language models (LLMs). Model-driven data filtering has increasingly become a primary approach for acquiring high-quality data. However, it still faces two main challenges: (1) the lack of an efficient data verification strategy makes it difficult to provide timely feedback on data quality; and (2) the selection of seed data for training classifiers lacks clear criteria and relies heavily on human expertise, introducing a degree of subjectivity. To address the first challenge, we introduce an efficient verification strategy that enables rapid evaluation of the impact of data on LLM training with minimal computational cost. To tackle the second challenge, we build upon the assumption that high-quality seed data is beneficial for LLM training, and by integrating the proposed verification strategy, we optimize the selection of positive and negative samples and propose an efficient data filtering pipeline. This pipeline not only improves filtering efficiency, classifier quality, and robustness, but also significantly reduces experimental and inference costs. In addition, to efficiently filter high-quality data, we employ a lightweight classifier based on fastText, and successfully apply the filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120 billion Chinese tokens. Empirical results demonstrate that the LLMs trained on Ultra-FineWeb exhibit significant performance improvements across multiple benchmark tasks, validating the effectiveness of our pipeline in enhancing both data quality and training efficiency.
Problem

Research questions and friction points this paper is trying to address.

Lack efficient data verification strategy for LLM training feedback
No clear criteria for seed data selection in classifiers
Need lightweight classifier to filter high-quality training data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Efficient verification strategy for LLM data quality
Optimized seed data selection using verification feedback
Lightweight fastText classifier for high-quality data filtering
🔎 Similar Papers
No similar papers found.
Y
Yudong Wang
ModelBest Inc.
Zixuan Fu
Zixuan Fu
Nanyang Technological University
Image RestorationGenerative ModelsLow-level Vision
J
Jie Cai
ModelBest Inc.
P
Peijun Tang
ModelBest Inc.
H
Hongya Lyu
ModelBest Inc.
Yewei Fang
Yewei Fang
Soochow University
Natural Language ProcessingLarge Language Model
Z
Zhi Zheng
ModelBest Inc.
J
Jie Zhou
ModelBest Inc.
G
Guoyang Zeng
ModelBest Inc.
Chaojun Xiao
Chaojun Xiao
Postdoctoral Researcher, Tsinghua University
Large Language Model
X
Xu Han
Tsinghua University
Z
Zhiyuan Liu
Tsinghua University