MegaMath: Pushing the Limits of Open Math Corpora

📅 2025-04-03

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

A large-scale, high-quality, open-source corpus for mathematical LLM pretraining is currently lacking. Method: We introduce MathCorpus—the first open-source mathematical pretraining corpus comprising 371 billion tokens—built via a novel tri-source collaborative paradigm: web resampling, code provenance tracing, and controllable synthesis. Our methodology integrates HTML-based structured mathematical content extraction, fastText-powered content filtering, fingerprint-based deduplication, Stack-V2 code filtering, and web- and code-driven QA generation alongside text-code hybrid synthesis. Contribution/Results: MathCorpus is the largest and highest-quality open mathematical corpus to date. When used for pretraining, it yields substantial performance gains on major benchmarks—including GSM8K, MATH, and AMC—consistently outperforming models trained solely on existing open mathematical data.

Technology Category

Application Category

📝 Abstract

Mathematical reasoning is a cornerstone of human intelligence and a key benchmark for advanced capabilities in large language models (LLMs). However, the research community still lacks an open, large-scale, high-quality corpus tailored to the demands of math-centric LLM pre-training. We present MegaMath, an open dataset curated from diverse, math-focused sources through following practices: (1) Revisiting web data: We re-extracted mathematical documents from Common Crawl with math-oriented HTML optimizations, fasttext-based filtering and deduplication, all for acquiring higher-quality data on the Internet. (2) Recalling Math-related code data: We identified high quality math-related code from large code training corpus, Stack-V2, further enhancing data diversity. (3) Exploring Synthetic data: We synthesized QA-style text, math-related code, and interleaved text-code blocks from web data or code data. By integrating these strategies and validating their effectiveness through extensive ablations, MegaMath delivers 371B tokens with the largest quantity and top quality among existing open math pre-training datasets.

Problem

Research questions and friction points this paper is trying to address.

Lack of open large-scale high-quality math corpus for LLM pre-training

Need for diverse math-focused data sources and quality filtering

Limited availability of synthetic math-related QA and code data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Re-extracted math documents with optimizations

Identified high-quality math-related code

Synthesized QA-style text and code

🔎 Similar Papers

MathWriting: A Dataset For Handwritten Mathematical Expression Recognition