🤖 AI Summary
Existing mathematical data augmentation methods overlook LLMs’ specific failure modes, yielding synthetically generated problems that lack diagnostic precision and yield marginal performance gains. To address this, we propose a deficiency-aware collaborative training framework: first, we perform fine-grained failure analysis to identify model weaknesses in mathematical reasoning; second, multiple expert LLMs collaboratively generate, critique, and iteratively refine high-difficulty, weakness-targeted problems; third, we apply progressive fine-tuning to strengthen deficient capabilities. Evaluated on six mainstream mathematical benchmarks, our method achieves an average 12.57% improvement over strong baselines and sets new state-of-the-art performance. Our core contribution lies in unifying failure diagnostics, defect-driven data synthesis, and progressive learning into a single, interpretable, and scalable paradigm for enhancing LLMs’ mathematical reasoning capabilities.
📝 Abstract
Large Language Models (LLMs) excel in solving mathematical problems, yet their performance is often limited by the availability of high-quality, diverse training data. Existing methods focus on augmenting datasets through rephrasing or difficulty progression but overlook the specific failure modes of LLMs. This results in synthetic questions that the model can already solve, providing minimal performance gains. To address this, we propose WarriorMath, a defect-aware framework for mathematical problem solving that integrates both targeted data synthesis and progressive training. In the synthesis stage, we employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems. Questions that base LLMs fail to solve are identified and iteratively improved through expert-level feedback, producing high-quality, defect-aware training data. In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses. Experiments on six mathematical benchmarks show that WarriorMath outperforms strong baselines by 12.57% on average, setting a new state-of-the-art. Our results demonstrate the effectiveness of a defect-aware, multi-expert framework for improving mathematical ability.