DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

📅 2025-04-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current LLM reasoning training suffers from unquantified data quality and empirically designed learning rate schedules lacking theoretical grounding. Method: We construct a large-scale dataset comprising 3.34 million difficulty-graded reasoning queries and 40 million cross-model distillation responses; identify and empirically validate the critical phenomenon that reasoning-oriented fine-tuning requires substantially higher learning rates; and propose a “difficulty–quality co-calibration” data filtering paradigm jointly leveraging pass rate and coefficient of variation (CV) to enhance interpretability and efficacy of distilled data. Contribution/Results: Our approach achieves a 79.2% pass rate on the AIME2024 benchmark—surpassing leading distillation-based models and approaching state-of-the-art performance. All data, code, and methodology are publicly released.

Technology Category

Application Category

📝 Abstract
Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M
Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM reasoning via difficulty-graded data training
Selecting valuable training data using pass rate and CV
Improving base model reasoning to approach state-of-the-art performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale difficulty-graded reasoning dataset construction
Precise data selection using pass rate and CV
Higher learning rates for reasoning-focused training
🔎 Similar Papers
No similar papers found.