rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

📅 2025-01-08

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Small language models (SLMs) face inherent limitations in complex mathematical reasoning, and conventional approaches rely on large-model distillation or costly human annotation. Method: This work introduces an MCTS-driven test-time deep reasoning framework enabling SLMs to autonomously enhance mathematical capabilities without external supervision. It features three core innovations: (i) code-augmented chain-of-thought (CoT) data synthesis, (ii) unsupervised process reward modeling, and (iii) end-to-end co-evolution of a policy model and a process preference model (PPM). Contribution/Results: On the MATH benchmark, Qwen2.5-Math-7B achieves 90.0% (+31.2 absolute gain) and Phi3-mini-3.8B reaches 86.4% (+45.0), both surpassing o1-preview. On AIME, the models solve 8/15 problems (53.3%), matching top high-school competitors. This is the first demonstration of autonomous, evolutionary capability leap in complex mathematical reasoning using purely small-scale models—without large-model assistance or human-labeled data.

Technology Category

Application Category

📝 Abstract

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising"deep thinking"through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na""ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

Problem

Research questions and friction points this paper is trying to address.

Small Language Models

Complex Mathematical Problems

Self-improvement and Deep Thinking

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monte Carlo Tree Search

Enhanced Reward Model

Self-improvement Mechanism

🔎 Similar Papers

No similar papers found.