GneissWeb: Preparing High Quality Data for LLMs at Scale

πŸ“… 2025-02-19
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the scarcity of high-quality, large-scale open-source data for large language model (LLM) pretraining, this work introduces GneissWebβ€”a 10-trillion-token, high-fidelity open-source dataset. We propose a synergistic data cleaning paradigm combining *shard-level exact substring deduplication* with *multi-stage quality filtering*, integrating ensemble-based quality scoring, heuristic rule-based filtering, and benchmark-driven evaluation. This approach preserves massive scale while substantially improving semantic purity and downstream generalization. Empirical evaluation shows that models pretrained on GneissWeb achieve an average +2.73 percentage points improvement over FineWeb-V1.1.0 across 11 diverse benchmarks; the gain remains robust at +1.75 points when extended to 20 benchmarks, demonstrating strong scalability and robustness.

Technology Category

Application Category

πŸ“ Abstract
Data quantity and quality play a vital role in determining the performance of Large Language Models (LLMs). High-quality data, in particular, can significantly boost the LLM's ability to generalize on a wide range of downstream tasks. Large pre-training datasets for leading LLMs remain inaccessible to the public, whereas many open datasets are small in size (less than 5 trillion tokens), limiting their suitability for training large models. In this paper, we introduce GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training LLMs. Our GneissWeb recipe that produced the dataset consists of sharded exact sub-string deduplication and a judiciously constructed ensemble of quality filters. GneissWeb achieves a favorable trade-off between data quality and quantity, producing models that outperform models trained on state-of-the-art open large datasets (5+ trillion tokens). We show that models trained using GneissWeb dataset outperform those trained on FineWeb-V1.1.0 by 2.73 percentage points in terms of average score computed on a set of 11 commonly used benchmarks (both zero-shot and few-shot) for pre-training dataset evaluation. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), models trained using GneissWeb still achieve a 1.75 percentage points advantage over those trained on FineWeb-V1.1.0.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM performance via quality data
Addressing scarcity of large pre-training datasets
Optimizing data quality and quantity balance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sharded exact sub-string deduplication
Judiciously constructed quality filters
Yields 10 trillion tokens dataset
πŸ”Ž Similar Papers
No similar papers found.
H
Hajar Emami Gohari
IBM Research
S
S. Kadhe
IBM Research
S
Syed Yousaf Shah
IBM Research
C
Constantin Adam
IBM Research
A
Abdulhamid A. Adebayo
IBM Research
P
Praneet Adusumilli
IBM Research
F
Farhan Ahmed
IBM Research
N
Nathalie Baracaldo Angel
IBM Research
S
Santosh Borse
IBM Research
Y
Yuan-Chi Chang
IBM Research
Xuan-Hong Dang
Xuan-Hong Dang
IBM Thomas J. Watson Research Center, NY, USA
data miningmachine learningartificial intelligence
N
Nirmit Desai
IBM Research
R
Ravital Eres
IBM Research
R
Ran Iwamoto
IBM Research
A
Alexei Karve
IBM Research
Y
Y. Koyfman
IBM Research
W
Wei-Han Lee
IBM Research
C
Changchang Liu
IBM Research
B
Boris Lublinsky
IBM Research
T
Takuyo Ohko
IBM Research
P
Pablo Pesce
IBM Research
M
Maroun Touma
IBM Research
Shiqiang Wang
Shiqiang Wang
IBM T. J. Watson Research Center
Agentic AICollaborative & Federated AILLMsMachine LearningOptimization Algorithms
S
Shalisha Witherspoon
IBM Research
H
Herbert Woisetschlager
IBM Research
D
David Wood
IBM Research
Kun-Lung Wu
Kun-Lung Wu
IBM Research
I
Issei Yoshida
IBM Research
Syed Zawad
Syed Zawad
Research Scientist, IBM
Machine LearningDistributed SystemsCloud ComputingFederated Learning
Petros Zerfos
Petros Zerfos
IBM T.J. Watson Research Center
Data for GenAIMachine LearningTime Series AnalysisQuantitative FinanceBig Data
Y
Yi Zhou
IBM Research
B
Bishwaranjan Bhattacharjee
IBM Research