🤖 AI Summary
Existing approaches to synthetic mathematical problem generation face high computational costs, complex prompting requirements, and limited controllability over problem difficulty. Method: We propose ScaleDiff—a three-stage framework comprising (1) a lightweight adaptive “reasoning-aware” model that identifies high-difficulty problems via single forward pass; (2) a compact, specialized generator, DiffGen-8B, trained on the ScaleDiff-Math dataset for low-cost, large-scale synthesis of complex math problems; and (3) capability transfer via fine-tuning Qwen2.5-Math-7B-Instruct. The framework eliminates hand-crafted prompts and drastically reduces API and training overhead. Contribution/Results: On high-stakes benchmarks including AIME, ScaleDiff achieves a mean accuracy of 65.9%, outperforming OpenThinker3 by 11.3%. Crucially, it provides the first empirical validation of a strong positive correlation between difficult-sample scale and reasoning performance—establishing a novel paradigm for efficient, data-driven advancement of mathematical reasoning models.
📝 Abstract
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.