Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

📅 2025-05-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of efficiently training trillion-parameter sparse large models on Ascend NPUs, this work proposes a system-level optimization framework. First, it introduces a hardware-aware lightweight MoE simulation framework to accelerate hyperparameter selection without extensive physical experimentation. Second, it designs a synergistic mechanism integrating expert parallelism with NPU-customized communication scheduling to minimize inter-chip communication overhead. Third, it enhances memory efficiency via activation/parameter memory reuse, quantization-based compression, and on-device memory layout optimization. The framework successfully trains the 718-billion-parameter Pangu Ultra MoE model across 6,000 Ascend NPUs, achieving a model flops utilization (MFU) of 30.0%—comparable to DeepSeek R1—and for the first time demonstrates the Ascend platform’s full-stack capability to support state-of-the-art sparse large model training.

Technology Category

Application Category

📝 Abstract
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into Expert Parallelism to optimize the communication between NPU devices to reduce the synchronization overhead. We also optimize the memory efficiency within the devices to further reduce the parameter and activation management overhead. In the end, we achieve an MFU of 30.0% when training Pangu Ultra MoE, with performance comparable to that of DeepSeek R1, on 6K Ascend NPUs, and demonstrate that the Ascend system is capable of harnessing all the training stages of the state-of-the-art language models. Extensive experiments indicate that our recipe can lead to efficient training of large-scale sparse language models with MoE. We also study the behaviors of such models for future reference.
Problem

Research questions and friction points this paper is trying to address.

Optimize large MoE model training on Ascend NPUs
Balance computing resources with dynamic sparse structures
Reduce communication and memory overhead in NPUs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation for optimal MoE hyperparameters selection
Expert Parallelism for NPU communication optimization
Memory efficiency enhancement for activation management
🔎 Similar Papers
No similar papers found.
Yehui Tang
Yehui Tang
Shanghai Jiao Tong University
Machine LearningQuantum AI & AI4Science
Yichun Yin
Yichun Yin
Noah's Ark Lab, Huawei
LLM
Y
Yaoyuan Wang
Pangu Team, Huawei
H
Hang Zhou
Pangu Team, Huawei
Y
Yu Pan
Pangu Team, Huawei
W
Wei Guo
Pangu Team, Huawei
Z
Ziyang Zhang
Pangu Team, Huawei
Miao Rang
Miao Rang
Huawei Technologies Co., Ltd.
computer vision
Fangcheng Liu
Fangcheng Liu
Huawei Noah's Ark Lab, Peking University
LLMsGenerative AIAdversarials
N
Naifu Zhang
Pangu Team, Huawei
B
Binghan Li
Pangu Team, Huawei
Y
Yonghan Dong
Pangu Team, Huawei
Xiaojun Meng
Xiaojun Meng
Noah's ark Lab, Huawei / Ph.D. @National University of Singapore/Bachelor @Tsinghua University
Big ModelNLPMultimodalHCI
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
D
Dong Li
Pangu Team, Huawei
Y
Yin Li
Pangu Team, Huawei
D
Dandan Tu
Pangu Team, Huawei
C
Can Chen
Pangu Team, Huawei
Youliang Yan
Youliang Yan
huawei
computer vision
Fisher Yu
Fisher Yu
Pangu Team, Huawei
R
Ruiming Tang
Pangu Team, Huawei
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision
B
Botian Huang
Pangu Team, Huawei
B
Bo Wang
Pangu Team, Huawei
B
Boxiao Liu
Pangu Team, Huawei
C
Changzheng Zhang
Pangu Team, Huawei
D
Da Kuang
Pangu Team, Huawei
F
Fei Liu
Pangu Team, Huawei
G
Gang Huang
Pangu Team, Huawei
J
Jiansheng Wei
Pangu Team, Huawei
Jiarui Qin
Jiarui Qin
Tencent
Large Language ModelRecommender SystemsInformation Retrieval
Jie Ran
Jie Ran
Pangu Team, Huawei
J
Jinpeng Li
Pangu Team, Huawei
J
Jun Zhao
Pangu Team, Huawei
L
Liang Dai
Pangu Team, Huawei
L
Lin Li
Pangu Team, Huawei
L
Liqun Deng
Pangu Team, Huawei
P
Peifeng Qin
Pangu Team, Huawei
Peng Zeng
Peng Zeng
Pangu Team, Huawei
Q
Qiang Gu
Pangu Team, Huawei
S
Shaohua Tang
Pangu Team, Huawei
S
Shengjun Cheng
Pangu Team, Huawei
T
Tao Gao
Pangu Team, Huawei
T
Tao Yu
Pangu Team, Huawei
Tianshu Li
Tianshu Li
Pangu Team, Huawei
T
Tianyu Bi
Pangu Team, Huawei
W
Wei He
Pangu Team, Huawei
W
Weikai Mao
Pangu Team, Huawei
W
Wenyong Huang
Pangu Team, Huawei
Wulong Liu
Wulong Liu
Unknown affiliation
Reinforcement LearningAutonomous DrivingRoboticsAI InfraEDA
X
Xiabing Li
Pangu Team, Huawei
Xianzhi Yu
Xianzhi Yu
Unknown affiliation
AIHPC
Xueyu Wu
Xueyu Wu
The University of Hong Kong
Distributed ML SystemsFederated Learning
X
Xu He
Pangu Team, Huawei
Yangkai Du
Yangkai Du
Pangu Team, Huawei
Y
Yan Xu
Pangu Team, Huawei
Y
Ye Tian
Pangu Team, Huawei
Yimeng Wu
Yimeng Wu
Huawei Noah's Ark Lab
Large Language Models
Y
Yongbing Huang
Pangu Team, Huawei
Y
Yong Tian
Pangu Team, Huawei
Y
Yong Zhu
Pangu Team, Huawei
Y
Yue Li
Pangu Team, Huawei
Y
Yufei Wang
Pangu Team, Huawei
Y
Yuhang Gai
Pangu Team, Huawei
Yujun Li
Yujun Li
Northwestern Polytechnical University
composite preformingcomposite mechanicsresin flowcomposite curing
Y
Yu Luo
Pangu Team, Huawei
Y
Yunsheng Ni
Pangu Team, Huawei
Y
Yusen Sun
Pangu Team, Huawei
Z
Zelin Chen
Pangu Team, Huawei
Z
Zhe Liu
Pangu Team, Huawei
Z
Zhicheng Liu
Pangu Team, Huawei
Zhipeng Tu
Zhipeng Tu
Pangu Team, Huawei
Z
Zilin Ding
Pangu Team, Huawei
Z
Zongyuan Zhan
Pangu Team, Huawei