🤖 AI Summary
Large language models (LLMs) exhibit weak chain-of-thought (CoT) reasoning in complex mathematics, heavily rely on external tools, and lack explicit supervision over reasoning processes.
Method: We propose Reinforcement Learning-driven Evolutionary Instruction Feedback (RLEIF), the first framework jointly optimizing instruction evolution and stepwise reasoning supervision. Built upon open-source base models (e.g., Mistral, Gemma), RLEIF integrates evolutionary instruction tuning, process-supervised reinforcement learning, and distillation of high-quality mathematical CoT data to enable fully language-based, end-to-end mathematical problem solving.
Contribution/Results: The resulting WizardMath-7B achieves state-of-the-art performance among open-source models of comparable size—92.3% on GSM8K and 52.4% on MATH. Its 70B variant surpasses GPT-3.5-Turbo, Claude 2, and early GPT-4, empirically validating the effectiveness and scalability of process-oriented reasoning optimization.
📝 Abstract
Large language models (LLMs), such as GPT-4, have shown remarkable performance in natural language processing (NLP) tasks, including challenging mathematical reasoning. However, most existing open-source models are only pre-trained on large-scale internet data and without math-related optimization. In this paper, we present WizardMath, which enhances the mathematical CoT reasoning abilities of LLMs without using external python tools, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math. Through extensive experiments on two mathematical reasoning benchmarks, namely GSM8k and MATH, we reveal the extraordinary capabilities of our model. Remarkably, WizardMath-Mistral 7B surpasses top-tier open-source LLMs by a substantial margin with higher data efficiency. Furthermore, WizardMath 70B even outperforms GPT-3.5-Turbo, Claude 2, Gemini Pro and GPT-4-early-version. Additionally, our preliminary exploration highlights the pivotal role of instruction evolution and process supervision in achieving exceptional math performance. For more details refer to https://github.com/nlpxucan/WizardLM