dots.llm1 Technical Report

📅 2025-06-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the prohibitively high computational cost of scaling large language models, this paper introduces an efficient sparse Mixture-of-Experts (MoE) language model with 142B total parameters but only 14B activated per token, substantially reducing training and inference overhead. Methodologically, it employs a dynamic-routing MoE architecture, pretrains exclusively on high-quality, human-curated data (11.2T tokens) without any synthetic data, leverages a customized high-throughput data pipeline, and integrates full-stage supervised fine-tuning and alignment techniques. Key contributions include: (i) the first open release of trillion-token-scale intermediate checkpoints, enabling empirical study of large-model learning dynamics; (ii) a fully synthetic-data-free training pipeline; and (iii) post-pretraining performance competitive with the dense Qwen2.5-72B model, while achieving significantly lower inference costs. All code and checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.
Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling language models with Mixture of Experts (MoE).
Reducing training and inference costs while maintaining performance.
Providing insights into large language model learning dynamics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts model with 14B active parameters
Efficient data processing pipeline for high-quality tokens
Open-source intermediate training checkpoints for research
🔎 Similar Papers
No similar papers found.
B
Bi Huo
B
Bin Tu
C
Cheng Qin
Da Zheng
Da Zheng
Amazon
High-performance computingData-intensive computingLarge-scale machine learningGraph neural networks
Debing Zhang
Debing Zhang
Xiaohongshu
Machine LearningComputer VisionDeep Learning
D
Dongjie Zhang
E
En Li
F
Fu Guo
Jian Yao
Jian Yao
Wuhan University
Computer VisionAI3DRoboticsSLAM
Jie Lou
Jie Lou
Xiaohongshu
AlignmentRLHF
Junfeng Tian
Junfeng Tian
L
Li Hu
R
Ran Zhu
S
Shengdong Chen
S
Shuo Liu
S
Su Guang
W
Weijun Zhang
X
Xiaoming Shi
X
Xinxin Peng
X
Xing Wu
Yawen Liu
Yawen Liu
Y
Yuqiu Ji
Z
Ze Wen
Z
Zhenhai Liu
Z
Zichao Li
Z
Zilong Liao