🤖 AI Summary
This work addresses the performance limitations of small language models (SLMs), specifically targeting 3B-parameter models. Methodologically, it introduces a fine-grained warmup-stabilize-decay (FG-WSD) learning rate scheduler, dual preference distillation to enhance knowledge transfer efficiency, and a multi-stage reinforcement learning framework integrating deep-reflection generation refinement, chain-of-thought reconstruction, and verification-based reward modeling—collectively improving reasoning capability and human alignment. Empirically, the proposed Nanbeige4-3B model achieves state-of-the-art results across major benchmarks among 3B-scale models, matching or exceeding the performance of leading 7B models. The model and training framework are fully open-sourced. This study establishes a systematic, reproducible technical pathway for developing high-performance, controllable SLMs, providing both methodological innovations and an empirical benchmark for efficient LLM research.
📝 Abstract
We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.