π€ AI Summary
This work addresses the limitation of existing reinforcement learning approaches in mathematical reasoning, which often neglect challenging problems and lack a systematic mechanism for progressive difficulty escalation, thereby constraining model performance on complex tasks. To overcome this, the authors propose MathForge, a novel framework that jointly emphasizes high-difficulty problems from both algorithmic and data perspectives. Algorithmically, they introduce Difficulty-aware Grouped Policy Optimization (DGPO) with a difficulty-balanced advantage estimator. On the data side, they develop Multi-dimensional Question Rewriting (MQR), which enables controllable difficulty enhancement while preserving answer consistency. Extensive experiments demonstrate that MathForge significantly outperforms current methods across multiple mathematical reasoning benchmarks, validating the efficacy of a difficulty-centric training paradigm for enhancing large language modelsβ reasoning capabilities.
π Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) offers a robust mechanism for enhancing mathematical reasoning in large models. However, we identify a systematic lack of emphasis on more challenging questions in existing methods from both algorithmic and data perspectives, despite their importance for refining underdeveloped capabilities. Algorithmically, widely used Group Relative Policy Optimization (GRPO) suffers from an implicit imbalance where the magnitude of policy updates is lower for harder questions. Data-wise, augmentation approaches primarily rephrase questions to enhance diversity without systematically increasing intrinsic difficulty. To address these issues, we propose a two-dual MathForge framework to improve mathematical reasoning by targeting harder questions from both perspectives, which comprises a Difficulty-Aware Group Policy Optimization (DGPO) algorithm and a Multi-Aspect Question Reformulation (MQR) strategy. Specifically, DGPO first rectifies the implicit imbalance in GRPO via difficulty-balanced group advantage estimation, and further prioritizes harder questions by difficulty-aware question-level weighting. Meanwhile, MQR reformulates questions across multiple aspects to increase difficulty while maintaining the original gold answer. Overall, MathForge forms a synergistic loop: MQR expands the data frontier, and DGPO effectively learns from the augmented data. Extensive experiments show that MathForge significantly outperforms existing methods on various mathematical reasoning tasks. The code and augmented data are all available at https://github.com/AMAP-ML/MathForge.