🤖 AI Summary
To address the prohibitively high computational cost of scaling large language models, this paper introduces an efficient sparse Mixture-of-Experts (MoE) language model with 142B total parameters but only 14B activated per token, substantially reducing training and inference overhead. Methodologically, it employs a dynamic-routing MoE architecture, pretrains exclusively on high-quality, human-curated data (11.2T tokens) without any synthetic data, leverages a customized high-throughput data pipeline, and integrates full-stage supervised fine-tuning and alignment techniques. Key contributions include: (i) the first open release of trillion-token-scale intermediate checkpoints, enabling empirical study of large-model learning dynamics; (ii) a fully synthetic-data-free training pipeline; and (iii) post-pretraining performance competitive with the dense Qwen2.5-72B model, while achieving significantly lower inference costs. All code and checkpoints are publicly released.
📝 Abstract
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.