MiroMind-M1: An Open-Source Advancement in Mathematical Reasoning via Context-Aware Multi-Stage Policy Optimization

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Open-source mathematical reasoning models suffer from low transparency, poor reproducibility, and suboptimal performance. Method: We introduce MiroMind-M1, a fully open-source family of reasoning language models built upon Qwen-2.5, trained via a two-stage reproducible pipeline: (i) supervised fine-tuning (SFT) on 719K math problems augmented with verified chain-of-thought rationales; followed by (ii) RLVR-based reinforcement learning on a curated 62K high-difficulty problem set. We propose a context-aware multi-stage policy optimization algorithm incorporating progressive sequence-length expansion and adaptive repetition penalty to enhance RL stability and token efficiency. Contribution/Results: MiroMind-M1 achieves state-of-the-art performance among open-source models of comparable scale (7B/32B) on AIME24, AIME25, and MATH benchmarks. Crucially, we fully release model weights, training data, and configuration scripts—enabling transparent, reproducible mathematical reasoning research.

Technology Category

Application Category

📝 Abstract

Large language models have recently evolved from fluent text generation to advanced reasoning across diverse domains, giving rise to reasoning language models. Among these domains, mathematical reasoning serves as a representative benchmark as it requires precise multi-step logic and abstract reasoning, which can be generalized to other tasks. While closed-source RLMs such as GPT-o3 demonstrate impressive reasoning capabilities, their proprietary nature limits transparency and reproducibility. Although many open-source projects aim to close this gap, most of them lack sufficient openness by omitting critical resources such as datasets and detailed training configurations, which hinders reproducibility. To contribute toward greater transparency in RLM development, we introduce the MiroMind-M1 series, a set of fully open-source RLMs built on the Qwen-2.5 backbone that match or exceed the performance of existing open-source RLMs. Specifically, our models are trained in two stages: SFT on a carefully curated corpus of 719K math-reasoning problems with verified CoT trajectories, followed by RLVR on 62K challenging and verifiable problems. To enhance the robustness and efficiency of the RLVR process, we introduce Context-Aware Multi-Stage Policy Optimization, an algorithm that integrates length-progressive training with an adaptive repetition penalty to encourage context-aware RL training. Our model achieves state-of-the-art or competitive performance and superior token efficiency among Qwen-2.5-based open-source 7B and 32B models on the AIME24, AIME25, and MATH benchmarks. To facilitate reproducibility, we release the complete stack: models (MiroMind-M1-SFT-7B, MiroMind-M1-RL-7B, MiroMind-M1-RL-32B); datasets (MiroMind-M1-SFT-719K, MiroMind-M1-RL-62K); and all training and evaluation configurations. We hope these resources will support further research and foster community advancement.

Problem

Research questions and friction points this paper is trying to address.

Developing open-source reasoning language models for mathematical reasoning

Addressing transparency and reproducibility gaps in existing RLMs

Enhancing robustness via context-aware multi-stage policy optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training with SFT and RLVR

Context-Aware Multi-Stage Policy Optimization

Open-source release of models and datasets

🔎 Similar Papers

BloomWise: Enhancing Problem-Solving capabilities of Large Language Models using Bloom's-Taxonomy-Inspired Prompts