🤖 AI Summary
To address training instability and low system efficiency in ultra-large dense large language models (LLMs), this paper proposes Depth-Scaled Sandwich Normalization, a novel normalization technique that effectively suppresses loss spikes during deep-layer training. Leveraging Ascend NPU clusters (8,192 accelerators), we construct Pangu Ultra—a 135-billion-parameter dense Transformer model—and achieve highly efficient, scalable training of hundred-billion-scale models on domestic hardware. Trained on 13.2 trillion high-quality tokens with pretraining and reinforcement-based post-training, Pangu Ultra surpasses comparably sized dense models—including Llama-405B and Mistral Large 2—across multiple benchmarks, while matching the performance of larger sparse models (e.g., DeepSeek-R1). This work demonstrates the competitiveness and feasibility of purely dense architectures on indigenous AI infrastructure, advancing scalable LLM training in resource-constrained, domestically supported environments.
📝 Abstract
We present Pangu Ultra, a Large Language Model (LLM) with 135 billion parameters and dense Transformer modules trained on Ascend Neural Processing Units (NPUs). Although the field of LLM has been witnessing unprecedented advances in pushing the scale and capability of LLM in recent years, training such a large-scale model still involves significant optimization and system challenges. To stabilize the training process, we propose depth-scaled sandwich normalization, which effectively eliminates loss spikes during the training process of deep models. We pre-train our model on 13.2 trillion diverse and high-quality tokens and further enhance its reasoning capabilities during post-training. To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1, whose sparse model structure contains much more parameters. Our exploration demonstrates that Ascend NPUs are capable of efficiently and effectively training dense models with more than 100 billion parameters. Our model and system will be available for our commercial customers.