Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs

📅 2025-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address training instability and low system efficiency in ultra-large dense large language models (LLMs), this paper proposes Depth-Scaled Sandwich Normalization, a novel normalization technique that effectively suppresses loss spikes during deep-layer training. Leveraging Ascend NPU clusters (8,192 accelerators), we construct Pangu Ultra—a 135-billion-parameter dense Transformer model—and achieve highly efficient, scalable training of hundred-billion-scale models on domestic hardware. Trained on 13.2 trillion high-quality tokens with pretraining and reinforcement-based post-training, Pangu Ultra surpasses comparably sized dense models—including Llama-405B and Mistral Large 2—across multiple benchmarks, while matching the performance of larger sparse models (e.g., DeepSeek-R1). This work demonstrates the competitiveness and feasibility of purely dense architectures on indigenous AI infrastructure, advancing scalable LLM training in resource-constrained, domestically supported environments.

Technology Category

Application Category

📝 Abstract
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.
Problem

Research questions and friction points this paper is trying to address.

Optimizing training stability for large-scale dense LLMs
Enhancing reasoning capabilities in post-training phases
Efficiently utilizing NPUs for billion-parameter model training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses depth-scaled sandwich normalization for stability
Trains on 13.2 trillion diverse high-quality tokens
Optimizes 8192 Ascend NPUs for large-scale training
🔎 Similar Papers
No similar papers found.
Yichun Yin
Yichun Yin
Noah's Ark Lab, Huawei
LLM
W
Wenyong Huang
Pangu Team, Huawei
K
Kaikai Song
Pangu Team, Huawei
Yehui Tang
Yehui Tang
Shanghai Jiao Tong University
Machine LearningQuantum AI & AI4Science
Xueyu Wu
Xueyu Wu
The University of Hong Kong
Distributed ML SystemsFederated Learning
W
Wei Guo
Pangu Team, Huawei
P
Peng Guo
Pangu Team, Huawei
Y
Yaoyuan Wang
Pangu Team, Huawei
Xiaojun Meng
Xiaojun Meng
Noah's ark Lab, Huawei / Ph.D. @National University of Singapore/Bachelor @Tsinghua University
Big ModelNLPMultimodalHCI
Yasheng Wang
Yasheng Wang
Tencent
Natural Language Processing
D
Dong Li
Pangu Team, Huawei
C
Can Chen
Pangu Team, Huawei
D
Dandan Tu
Pangu Team, Huawei
Y
Yin Li
Pangu Team, Huawei
Fisher Yu
Fisher Yu
Pangu Team, Huawei
R
Ruiming Tang
Pangu Team, Huawei
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision
Baojun Wang
Baojun Wang
Huawei Noah’s Ark Lab
NLP
B
Bin Wang
Pangu Team, Huawei
B
Bo Wang
Pangu Team, Huawei
B
Boxiao Liu
Pangu Team, Huawei
C
Changzheng Zhang
Pangu Team, Huawei
Duyu Tang
Duyu Tang
Huawei
Natural Language Processing
Fei Mi
Fei Mi
Huawei Noah's Ark Lab
LLM Post Training
H
Hui Jin
Pangu Team, Huawei
J
Jiansheng Wei
Pangu Team, Huawei
Jiarui Qin
Jiarui Qin
Tencent
Large Language ModelRecommender SystemsInformation Retrieval
J
Jinpeng Li
Pangu Team, Huawei
J
Jun Zhao
Pangu Team, Huawei
L
Liqun Deng
Pangu Team, Huawei
L
Lin Li
Pangu Team, Huawei
M
Minghui Xu
Pangu Team, Huawei
N
Naifu Zhang
Pangu Team, Huawei
N
Nianzu Zheng
Pangu Team, Huawei
Q
Qiang Li
Pangu Team, Huawei
R
Rongju Ruan
Pangu Team, Huawei
S
Shengjun Cheng
Pangu Team, Huawei
T
Tianyu Guo
Pangu Team, Huawei
W
Wei He
Pangu Team, Huawei
W
Wei Li
Pangu Team, Huawei
Weiwen Liu
Weiwen Liu
Associate Professor, Shanghai Jiao Tong University
large language modelsAI agentsrecommender systems
Wulong Liu
Wulong Liu
Unknown affiliation
Reinforcement LearningAutonomous DrivingRoboticsAI InfraEDA
Xinyi Dai
Xinyi Dai
Noah's Ark Lab, Huawei
Information RetrievalRecommender SystemLarge Language Models
Y
Yonghan Dong
Pangu Team, Huawei
Y
Yu Pan
Pangu Team, Huawei
Y
Yue Li
Pangu Team, Huawei
Y
Yufei Wang
Pangu Team, Huawei
Yujun Li
Yujun Li
Northwestern Polytechnical University
composite preformingcomposite mechanicsresin flowcomposite curing
Y
Yunsheng Ni
Pangu Team, Huawei
Z
Zhe Liu
Pangu Team, Huawei
Z
Zhenhe Zhang
Pangu Team, Huawei
Z
Zhicheng Liu
Pangu Team, Huawei