SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the high deployment cost and poor adaptability of large language models (LLMs) in resource-constrained environments, this work introduces SmolLM2—a high-performance small language model with 1.7 billion parameters. Methodologically, we curate three high-quality, domain-specific datasets—FineMath, Stack-Edu, and SmolTalk—and propose a novel training paradigm featuring stage-wise human evaluation feedback to dynamically adjust data mixture ratios. Our training pipeline integrates multi-stage incremental pretraining and supervised fine-tuning, guided by a data-quality-driven hybrid strategy and ablation-informed iterative optimization of dataset proportions. Experiments demonstrate that SmolLM2 achieves state-of-the-art performance among small models on diverse benchmarks, significantly outperforming Qwen2.5-1.5B and Llama3.2-1B in mathematical reasoning, code generation, and general language understanding. Both the model and all proprietary datasets are publicly released.

Technology Category

Application Category

📝 Abstract

While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art"small"(1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.

Problem

Research questions and friction points this paper is trying to address.

Develops efficient small language model

Optimizes data-centric training process

Enhances performance with specialized datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-centric multi-stage training process

Specialized datasets integration

Manual refinement of dataset mixing

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models