Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

📅 2024-12-03
🏛️ arXiv.org
📈 Citations: 14
Influential: 2
📄 PDF
🤖 AI Summary
To address the trade-off between data fidelity and scalability when pretraining large language models on Common Crawl at long token horizons (e.g., 15T tokens), this work introduces a novel data filtering paradigm integrating multi-classifier ensembling, synthetic data rephrasing, lightweight model–guided heuristics, large-scale deduplication, and fine-grained quality assessment. Retaining ~80% of original tokens, the method preserves four times more high-quality data than current benchmarks. When training an 8B-parameter model on 15T tokens, it achieves +5.0, +3.1, and +0.5 points on MMLU, ARC-Challenge, and an aggregate of ten standard benchmarks, respectively—significantly outperforming Llama 3.1 8B. This is the first approach to achieve both high-fidelity and scalable web data purification, delivering a reproducible, extensible framework for constructing high-quality training corpora for foundation language models.

Technology Category

Application Category

📝 Abstract
Recent English Common Crawl datasets like FineWeb-Edu and DCLM achieved significant benchmark gains via aggressive model-based filtering, but at the cost of removing 90% of data. This limits their suitability for long token horizon training, such as 15T tokens for Llama 3.1. In this paper, we show how to achieve better trade-offs between accuracy and data quantity by a combination of classifier ensembling, synthetic data rephrasing, and reduced reliance on heuristic filters. When training 8B parameter models for 1T tokens, using a high-quality subset of our data improves MMLU by 5.6 over DCLM, demonstrating the efficacy of our methods for boosting accuracies over a relatively short token horizon. Furthermore, our full 6.3T token dataset matches DCLM on MMLU, but contains four times more unique real tokens than DCLM. This unlocks state-of-the-art training over a long token horizon: an 8B parameter model trained for 15T tokens, of which 7.2T came from our dataset, is better than the Llama 3.1 8B model: +5 on MMLU, +3.1 on ARC-Challenge, and +0.5 on average across ten diverse tasks. The dataset is available at https://data.commoncrawl.org/contrib/Nemotron/Nemotron-CC/index.html
Problem

Research questions and friction points this paper is trying to address.

Balancing accuracy and data quantity in pretraining datasets
Enhancing long-horizon training with refined Common Crawl data
Improving model performance via classifier ensembling and synthetic rephrasing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Classifier ensembling for data refinement
Synthetic data rephrasing to enhance quality
Reduced heuristic filters for more data retention
🔎 Similar Papers
No similar papers found.