🤖 AI Summary
Current AI systems exhibit limited capabilities in complex mathematical reasoning, primarily due to the scarcity of high-quality training data—existing datasets are small-scale, insufficiently challenging, contaminated by benchmark leakage, and lack automatically verifiable solutions.
Method: We introduce MathDiverse, the first large-scale, decontaminated, high-difficulty (Levels 5–9) mathematical reasoning dataset comprising 103K problems. Each problem includes a formal, machine-verifiable solution and three diverse, R1-large-model-generated reasoning paths. We propose a rigorous benchmark decontamination pipeline (incorporating source analysis and contamination detection), a difficulty-aware filtering mechanism, and a rule-driven answer verification framework—enabling supervised fine-tuning, reinforcement learning, and knowledge distillation.
Contribution/Results: Models trained on MathDiverse achieve substantial performance gains on challenging benchmarks—including AMC23, AIME, and MATH—demonstrating measurable advances in mathematical reasoning. The dataset is publicly released to foster community progress.
📝 Abstract
The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence. While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks. To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation. Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness. We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: https://github.com/zwhe99/DeepMath.