Nanbeige4-3B Technical Report: Exploring the Frontier of Small Language Models

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the performance limitations of small language models (SLMs), specifically targeting 3B-parameter models. Methodologically, it introduces a fine-grained warmup-stabilize-decay (FG-WSD) learning rate scheduler, dual preference distillation to enhance knowledge transfer efficiency, and a multi-stage reinforcement learning framework integrating deep-reflection generation refinement, chain-of-thought reconstruction, and verification-based reward modeling—collectively improving reasoning capability and human alignment. Empirically, the proposed Nanbeige4-3B model achieves state-of-the-art results across major benchmarks among 3B-scale models, matching or exceeding the performance of leading 7B models. The model and training framework are fully open-sourced. This study establishes a systematic, reproducible technical pathway for developing high-performance, controllable SLMs, providing both methodological innovations and an empirical benchmark for efficient LLM research.

Technology Category

Application Category

📝 Abstract

We present Nanbeige4-3B, a family of small-scale but high-performing language models. Pretrained on 23T high-quality tokens and finetuned on over 30 million diverse instructions, we extend the boundary of the scaling law for small language models. In pre-training, we design a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler, which progressively refines data mixtures across stages to boost model performance. In post-training, to improve the quality of the SFT data, we design a joint mechanism that integrates deliberative generation refinement and chain-of-thought reconstruction, yielding substantial gains on complex tasks. Following SFT, we employ our flagship reasoning model to distill Nanbeige4-3B through our proposed Dual Preference Distillation (DPD) method, which leads to further performance gains. Finally, a multi-stage reinforcement learning phase was applied, leveraging verifiable rewards and preference modeling to strengthen abilities on both reasoning and human alignment. Extensive evaluations show that Nanbeige4-3B not only significantly outperforms models of comparable parameter scale but also rivals much larger models across a wide range of benchmarks. The model checkpoints are available at https://huggingface.co/Nanbeige.

Problem

Research questions and friction points this paper is trying to address.

Develops a high-performing small language model family

Enhances model performance with novel training schedulers

Improves reasoning and alignment through distillation and reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-Grained Warmup-Stable-Decay scheduler for pre-training

Joint deliberative generation and chain-of-thought refinement for SFT

Dual Preference Distillation and multi-stage RL for reasoning

🔎 Similar Papers

No similar papers found.