ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing approaches to synthetic mathematical problem generation face high computational costs, complex prompting requirements, and limited controllability over problem difficulty. Method: We propose ScaleDiff—a three-stage framework comprising (1) a lightweight adaptive “reasoning-aware” model that identifies high-difficulty problems via single forward pass; (2) a compact, specialized generator, DiffGen-8B, trained on the ScaleDiff-Math dataset for low-cost, large-scale synthesis of complex math problems; and (3) capability transfer via fine-tuning Qwen2.5-Math-7B-Instruct. The framework eliminates hand-crafted prompts and drastically reduces API and training overhead. Contribution/Results: On high-stakes benchmarks including AIME, ScaleDiff achieves a mean accuracy of 65.9%, outperforming OpenThinker3 by 11.3%. Crucially, it provides the first empirical validation of a strong positive correlation between difficult-sample scale and reasoning performance—establishing a novel paradigm for efficient, data-driven advancement of mathematical reasoning models.

Technology Category

Application Category

📝 Abstract

Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between "Thinking" and "NoThinking" modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME'24, AIME'25, HMMT-Feb'25, BRUMO'25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.

Problem

Research questions and friction points this paper is trying to address.

Automated synthesis of difficult mathematical problems faces high computational costs

Existing methods struggle with complex prompting and limited problem difficulty levels

Scaling up generation of advanced reasoning problems remains challenging and expensive

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive thinking model identifies difficult problems

Specialized generator creates problems without complex prompting

Fine-tuning on scaled dataset boosts reasoning performance

🔎 Similar Papers

No similar papers found.

Bosch Group

Renningen, BW, DE

PhD GenAI Research Scientist Intern

Databricks

SF Bay Area Hourly Rate$54—$60 USD

San Francisco, CA, USA

Authors to Follow