DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning

📅 2025-04-15

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Current AI systems exhibit limited capabilities in complex mathematical reasoning, primarily due to the scarcity of high-quality training data—existing datasets are small-scale, insufficiently challenging, contaminated by benchmark leakage, and lack automatically verifiable solutions. Method: We introduce MathDiverse, the first large-scale, decontaminated, high-difficulty (Levels 5–9) mathematical reasoning dataset comprising 103K problems. Each problem includes a formal, machine-verifiable solution and three diverse, R1-large-model-generated reasoning paths. We propose a rigorous benchmark decontamination pipeline (incorporating source analysis and contamination detection), a difficulty-aware filtering mechanism, and a rule-driven answer verification framework—enabling supervised fine-tuning, reinforcement learning, and knowledge distillation. Contribution/Results: Models trained on MathDiverse achieve substantial performance gains on challenging benchmarks—including AMC23, AIME, and MATH—demonstrating measurable advances in mathematical reasoning. The dataset is publicly released to foster community progress.

Technology Category

Application Category

📝 Abstract

The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence. While reinforcement learning (RL) applied to LLMs shows promise, progress is significantly hindered by the lack of large-scale training data that is sufficiently challenging, possesses verifiable answer formats suitable for RL, and is free from contamination with evaluation benchmarks. To address these limitations, we introduce DeepMath-103K, a new, large-scale dataset comprising approximately 103K mathematical problems, specifically designed to train advanced reasoning models via RL. DeepMath-103K is curated through a rigorous pipeline involving source analysis, stringent decontamination against numerous benchmarks, and filtering for high difficulty (primarily Levels 5-9), significantly exceeding existing open resources in challenge. Each problem includes a verifiable final answer, enabling rule-based RL, and three distinct R1-generated solutions suitable for diverse training paradigms like supervised fine-tuning or distillation. Spanning a wide range of mathematical topics, DeepMath-103K promotes the development of generalizable reasoning. We demonstrate that models trained on DeepMath-103K achieve significant improvements on challenging mathematical benchmarks, validating its effectiveness. We release DeepMath-103K publicly to facilitate community progress in building more capable AI reasoning systems: https://github.com/zwhe99/DeepMath.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale challenging math data for AI training

Need for verifiable answer formats suitable for RL

Dataset contamination with evaluation benchmarks hinders progress

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale dataset with 103K math problems

Rigorous decontamination and high difficulty filtering

Includes verifiable answers and multiple solutions

🔎 Similar Papers

No similar papers found.