Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

📅 2025-05-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) exhibit performance bottlenecks in program synthesis and mathematical reasoning, primarily due to insufficient quality of pretraining corpora. Method: We propose a “transform-and-preserve” paradigm to construct domain-specific, high-quality pretraining datasets—SwallowMath and SwallowCode—via a four-stage LLM rewriting pipeline: syntactic validation, Pylint-style static filtering, two-stage LLM-based rewriting, and context restoration with structured reordering of solution steps. Unlike simple filtering, this approach systematically enhances low-quality code and mathematical text. Contribution/Results: This is the first open-source, reproducible, end-to-end framework for domain-adaptive pretraining data augmentation. Applying continual pretraining on Llama-3.1-8B using our datasets yields +17.0 absolute improvement in pass@1 on HumanEval and +12.4% accuracy gain on GSM8K. All datasets, prompt templates, and model checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
The performance of large language models (LLMs) in program synthesis and mathematical reasoning is fundamentally limited by the quality of their pre-training corpora. We introduce two openly licensed datasets, released under the Llama 3.3 Community License, that significantly enhance LLM performance by systematically rewriting public data. SwallowCode (approximately 16.1 billion tokens) refines Python snippets from The-Stack-v2 through a novel four-stage pipeline: syntax validation, pylint-based style filtering, and a two-stage LLM rewriting process that enforces style conformity and transforms snippets into self-contained, algorithmically efficient examples. Unlike prior methods that rely on exclusionary filtering or limited transformations, our transform-and-retain approach upgrades low-quality code, maximizing data utility. SwallowMath (approximately 2.3 billion tokens) enhances Finemath-4+ by removing boilerplate, restoring context, and reformatting solutions into concise, step-by-step explanations. Within a fixed 50 billion token training budget, continual pre-training of Llama-3.1-8B with SwallowCode boosts pass@1 by +17.0 on HumanEval and +17.7 on HumanEval+ compared to Stack-Edu, surpassing the baseline model's code generation capabilities. Similarly, substituting SwallowMath yields +12.4 accuracy on GSM8K and +7.6 on MATH. Ablation studies confirm that each pipeline stage contributes incrementally, with rewriting delivering the largest gains. All datasets, prompts, and checkpoints are publicly available, enabling reproducible research and advancing LLM pre-training for specialized domains.
Problem

Research questions and friction points this paper is trying to address.

Enhance LLM performance in math and code via data rewriting
Improve pre-training corpora quality for program synthesis and reasoning
Systematically transform low-quality code into efficient, self-contained examples
Innovation

Methods, ideas, or system contributions that make the work stand out.

Systematically rewrites public data for LLM enhancement
Uses four-stage pipeline for code quality improvement
Reformats math solutions into step-by-step explanations
🔎 Similar Papers
No similar papers found.
Kazuki Fujii
Kazuki Fujii
Institute of Science Tokyo
Systems for Machine Learning
Y
Yukito Tajima
Institute of Science Tokyo, Department of Computer Science
Sakae Mizuki
Sakae Mizuki
Hottolink, Inc. / Institute of Science Tokyo
machine learningnatural language processingrepresentation learningcomputational statistics
H
Hinari Shimada
Institute of Science Tokyo, Department of Computer Science
T
Taihei Shiotani
Institute of Science Tokyo, Department of Computer Science
K
Koshiro Saito
Institute of Science Tokyo, Department of Computer Science
M
Masanari Ohi
Institute of Science Tokyo, Department of Computer Science
M
Masaki Kawamura
Institute of Science Tokyo, Department of Computer Science
Taishi Nakamura
Taishi Nakamura
Institute of Science Tokyo
artificial general intelligencelarge language modelsmachine learning
T
Takumi Okamoto
Institute of Science Tokyo, Department of Computer Science
S
Shigeki Ishida
Institute of Science Tokyo, Department of Computer Science
K
Kakeru Hattori
Institute of Science Tokyo, Department of Computer Science; National Institute of Advanced Industrial Science and Technology
Youmi Ma
Youmi Ma
Institute of Science Tokyo
Information ExtractionKnowledge AcquisitionNatural Language ProcessingArtificial Intelligence
H
Hiroya Takamura
National Institute of Advanced Industrial Science and Technology
Rio Yokota
Rio Yokota
Professor, Institute of Science Tokyo
high performance computinglarge scale deep learninghierarchical low-rank matricesGPU computing
Naoaki Okazaki
Naoaki Okazaki
Institute of Science Tokyo
natural language processingartificial intelligencemachine learning