dots.llm1 Technical Report

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the prohibitively high computational cost of scaling large language models, this paper introduces an efficient sparse Mixture-of-Experts (MoE) language model with 142B total parameters but only 14B activated per token, substantially reducing training and inference overhead. Methodologically, it employs a dynamic-routing MoE architecture, pretrains exclusively on high-quality, human-curated data (11.2T tokens) without any synthetic data, leverages a customized high-throughput data pipeline, and integrates full-stage supervised fine-tuning and alignment techniques. Key contributions include: (i) the first open release of trillion-token-scale intermediate checkpoints, enabling empirical study of large-model learning dynamics; (ii) a fully synthetic-data-free training pipeline; and (iii) post-pretraining performance competitive with the dense Qwen2.5-72B model, while achieving significantly lower inference costs. All code and checkpoints are publicly released.

Technology Category

Application Category

📝 Abstract

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.

Problem

Research questions and friction points this paper is trying to address.

Efficiently scaling language models with Mixture of Experts (MoE).

Reducing training and inference costs while maintaining performance.

Providing insights into large language model learning dynamics.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts model with 14B active parameters

Efficient data processing pipeline for high-quality tokens

Open-source intermediate training checkpoints for research

🔎 Similar Papers

No similar papers found.