Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

📅 2025-08-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing mathematical pretraining datasets suffer from low quality due to fragile heuristic extraction, lossy HTML parsing, and structural degradation of mathematical content. This work introduces the first domain-agnostic, robust scientific text extraction pipeline, integrating layout-aware rendering (via Lynx), LLM-driven cleaning, and precise HTML-to-text conversion to enable structured extraction of mathematical formulas and code blocks, alongside LaTeX standardization. Leveraging this pipeline, we construct two high-quality, open-source datasets: Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens), surpassing existing open benchmarks in both scale and fidelity. Evaluation shows substantial improvements—+12.6 on MATH and +14.3 on MBPP+—while also enhancing performance on general-purpose benchmarks such as MMLU. These datasets establish a reliable, scalable foundation for training advanced mathematical language models.

Technology Category

Application Category

📝 Abstract

Pretraining large language models (LLMs) on high-quality, structured data such as mathematics and code substantially enhances reasoning capabilities. However, existing math-focused datasets built from Common Crawl suffer from degraded quality due to brittle extraction heuristics, lossy HTML-to-text conversion, and the failure to reliably preserve mathematical structure. In this work, we introduce Nemotron-CC-Math, a large-scale, high-quality mathematical corpus constructed from Common Crawl using a novel, domain-agnostic pipeline specifically designed for robust scientific text extraction. Unlike previous efforts, our pipeline recovers math across various formats (e.g., MathJax, KaTeX, MathML) by leveraging layout-aware rendering with lynx and a targeted LLM-based cleaning stage. This approach preserves the structural integrity of equations and code blocks while removing boilerplate, standardizing notation into LaTeX representation, and correcting inconsistencies. We collected a large, high-quality math corpus, namely Nemotron-CC-Math-3+ (133B tokens) and Nemotron-CC-Math-4+ (52B tokens). Notably, Nemotron-CC-Math-4+ not only surpasses all prior open math datasets-including MegaMath, FineMath, and OpenWebMath-but also contains 5.5 times more tokens than FineMath-4+, which was previously the highest-quality math pretraining dataset. When used to pretrain a Nemotron-T 8B model, our corpus yields +4.8 to +12.6 gains on MATH and +4.6 to +14.3 gains on MBPP+ over strong baselines, while also improving general-domain performance on MMLU and MMLU-Stem. We present the first pipeline to reliably extract scientific content--including math--from noisy web-scale data, yielding measurable gains in math, code, and general reasoning, and setting a new state of the art among open math pretraining corpora. To support open-source efforts, we release our code and datasets.

Problem

Research questions and friction points this paper is trying to address.

Improving math pretraining data quality from Common Crawl

Preserving mathematical structure during web content extraction

Enhancing LLM reasoning with high-quality scientific text

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging layout-aware rendering for math extraction

Using LLM-based cleaning to preserve structural integrity

Standardizing diverse math formats into LaTeX representation

🔎 Similar Papers

No similar papers found.