Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance limitations of small language models (SLMs), specifically targeting 3B-parameter models. Methodologically, it introduces a fine-grained warmup-stabilize-decay (FG-WSD) learning rate scheduler, dual preference distillation to enhance knowledge transfer efficiency, and a multi-stage reinforcement learning framework integrating deep-reflection generation refinement, chain-of-thought reconstruction, and verification-based reward modeling—collectively improving reasoning capability and human alignment. Empirically, the proposed Nanbeige4-3B model achieves state-of-the-art results across major benchmarks among 3B-scale models, matching or exceeding the performance of leading 7B models. The model and training framework are fully open-sourced. This study establishes a systematic, reproducible technical pathway for developing high-performance, controllable SLMs, providing both methodological innovations and an empirical benchmark for efficient LLM research.

Technology Category

Application Category

📝 Abstract
We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.
Problem

Research questions and friction points this paper is trying to address.

Develops a high-performing small language model family
Enhances model performance with novel training schedulers
Improves reasoning and alignment through distillation and reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-Grained Warmup-Stable-Decay scheduler for pre-training
Joint deliberative generation and chain-of-thought refinement for SFT
Dual Preference Distillation and multi-stage RL for reasoning
🔎 Similar Papers
No similar papers found.
C
Chen Yang
Nanbeige LLM Lab, Boss Zhipin
Guangyue Peng
Guangyue Peng
Peking University
J
Jiaying Zhu
Nanbeige LLM Lab, Boss Zhipin
R
Ran Le
Nanbeige LLM Lab, Boss Zhipin
R
Ruixiang Feng
Nanbeige LLM Lab, Boss Zhipin
T
Tao Zhang
Nanbeige LLM Lab, Boss Zhipin
Wei Ruan
Wei Ruan
University of Georgia
X
Xiaoqi Liu
Nanbeige LLM Lab, Boss Zhipin
Xiaoxue Cheng
Xiaoxue Cheng
Renmin University of China
X
Xiyun Xu
Nanbeige LLM Lab, Boss Zhipin
Y
Yang Song
Nanbeige LLM Lab, Boss Zhipin
Y
Yanzipeng Gao
Nanbeige LLM Lab, Boss Zhipin
Y
Yiming Jia
Nanbeige LLM Lab, Boss Zhipin
Yun Xing
Yun Xing
School of Computer Science and Engineering, Nanyang Technological University
Computer Vision
Y
Yuntao Wen
Nanbeige LLM Lab, Boss Zhipin
Z
Zekai Wang
Nanbeige LLM Lab, Boss Zhipin
Z
Zhenwei An
Nanbeige LLM Lab, Boss Zhipin
Z
Zhicong Sun
Nanbeige LLM Lab, Boss Zhipin
Z
Zongchao Chen
Nanbeige LLM Lab, Boss Zhipin