Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

📅 2025-05-07

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

To address the challenge of efficiently training trillion-parameter sparse large models on Ascend NPUs, this work proposes a system-level optimization framework. First, it introduces a hardware-aware lightweight MoE simulation framework to accelerate hyperparameter selection without extensive physical experimentation. Second, it designs a synergistic mechanism integrating expert parallelism with NPU-customized communication scheduling to minimize inter-chip communication overhead. Third, it enhances memory efficiency via activation/parameter memory reuse, quantization-based compression, and on-device memory layout optimization. The framework successfully trains the 718-billion-parameter Pangu Ultra MoE model across 6,000 Ascend NPUs, achieving a model flops utilization (MFU) of 30.0%—comparable to DeepSeek R1—and for the first time demonstrates the Ascend platform’s full-stack capability to support state-of-the-art sparse large model training.

Technology Category

Application Category

📝 Abstract

Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.

Problem

Research questions and friction points this paper is trying to address.

Optimize large MoE model training on Ascend NPUs

Balance computing resources with dynamic sparse structures

Reduce communication and memory overhead in NPUs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation for optimal MoE hyperparameters selection

Expert Parallelism for NPU communication optimization

Memory efficiency enhancement for activation management

🔎 Similar Papers

No similar papers found.