MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work challenges the prevailing assumption that strong reasoning capabilities in language models necessitate ultra-large-scale training data, investigating whether sub-billion-parameter models can achieve competitive reasoning performance under data-constrained regimes. We propose a data-centric approach grounded in rigorous quality assessment, employing fine-grained filtering and resampling to construct a 4.2T-token pretraining corpus from only ~2T tokens of high-quality open-source text—substantially reducing data dependency. Using standard post-training protocols, we train MobileLLM-R1-950M, which achieves 15.5 on AIME—surpassing comparable open-source models. Remarkably, it attains this performance using just 11.7% of the pretraining compute required by Qwen3-0.6B. To our knowledge, this is the first study to empirically demonstrate that carefully curated, high-quality small-scale data can effectively elicit deep reasoning capabilities in compact language models, establishing a new paradigm for efficient, low-cost reasoning model development.

Technology Category

Application Category

📝 Abstract

The paradigm shift in large language models (LLMs) from instinctive responses to chain-of-thought (CoT) reasoning has fueled two prevailing assumptions: (1) reasoning capabilities only emerge in sufficiently large models, and (2) such capabilities require training on massive datasets. While the first assumption has already been challenged by recent sub-billion-parameter reasoning models such as Qwen3-0.6B and DeepSeek distilled variants, the second remains largely unquestioned. In this work, we revisit the necessity of scaling to extremely large corpora (>10T tokens) for reasoning emergence. By carefully curating and resampling open-source datasets that we identify as beneficial under our designed metrics, we demonstrate that strong reasoning abilities can emerge with far less data. Specifically, we show that only ~2T tokens of high-quality data are sufficient, and pre-training with 4.2T tokens on the dataset resampled from these ~2T tokens, followed by a established post-training procedure, enables the development of MobileLLM-R1, a series of sub-billion-parameter reasoning models that substantially outperform prior models trained on fully open-sourced data. For example, MobileLLM-R1-950M achieves an AIME score of 15.5, compared to just 0.6 for OLMo-2-1.48B and 0.3 for SmolLM-2-1.7B. Remarkably, despite being trained on only 11.7% of the tokens compared to Qwen3's proprietary 36T-token corpus for pretraining, MobileLLM-R1-950M matches or surpasses Qwen3-0.6B across multiple reasoning benchmarks. To facilitate further research in this direction, we have released the complete training recipe, data sources, data mixing ratio, and model checkpoints, together with the key insights obtained throughout this study.

Problem

Research questions and friction points this paper is trying to address.

Challenging the need for massive datasets to develop reasoning capabilities

Exploring sub-billion parameter language models with limited training data

Developing efficient reasoning models using carefully curated open-source data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses high-quality curated data for training

Applies data resampling to enhance efficiency

Employs established post-training procedure

🔎 Similar Papers

No similar papers found.