HiFloat4 Format for Language Model Pre-training on Ascend NPUs

📅 2026-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational and memory costs of large language model pretraining by systematically evaluating and validating, for the first time, the feasibility of full FP4-precision pretraining using the HiFloat4 format on Huawei Ascend NPU clusters. The study encompasses both dense and Mixture-of-Experts (MoE) architectures, covering linear layers and expert GEMM operations. By introducing FP4-aware numerical stabilization techniques, the approach achieves a fourfold improvement in computational throughput and memory efficiency while maintaining relative errors below 1%. Experimental results demonstrate that models such as Pangu and LLaMA—both dense and MoE variants—trained with this method attain performance nearly on par with full-precision baselines, significantly advancing the practicality of FP4 for large-scale pretraining.

Technology Category

Application Category

📝 Abstract
Large foundation models have become central to modern machine learning, with performance scaling predictably with model size and data. However, training and deploying such models incur substantial computational and memory costs, motivating the development of low-precision training techniques. Recent work has demonstrated that 4-bit floating-point (FP4) formats--such as MXFP4 and NVFP4--can be successfully applied to linear GEMM operations in large language models (LLMs), achieving up to 4x improvements in compute throughput and memory efficiency compared to higher-precision baselines. In this work, we investigate the recently proposed HiFloat4 FP4 format for Huawei Ascend NPUs and systematically compare it with MXFP4 in large-scale training settings. All experiments are conducted on Ascend NPU clusters, with linear and expert GEMM operations performed entirely in FP4 precision. We evaluate both dense architectures (e.g., Pangu and LLaMA-style models) and mixture-of-experts (MoE) models, where both standard linear layers and expert-specific GEMMs operate in FP4. Furthermore, we explore stabilization techniques tailored to FP4 training that significantly reduce numerical degradation, maintaining relative error within 1% of full-precision baselines while preserving the efficiency benefits of 4-bit computation. Our results provide a comprehensive empirical study of FP4 training on NPUs and highlight the practical trade-offs between FP4 formats in large-scale dense and MoE models.
Problem

Research questions and friction points this paper is trying to address.

FP4
low-precision training
large language models
numerical stability
Ascend NPU
Innovation

Methods, ideas, or system contributions that make the work stand out.

HiFloat4
FP4 training
Ascend NPU
low-precision computing
mixture-of-experts
🔎 Similar Papers
No similar papers found.