Quartet: Native FP4 Training Can Be Optimal for Large Language Models

📅 2025-05-20

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing FP4 training methods suffer from substantial accuracy degradation and rely heavily on mixed-precision fallbacks, hindering realization of their computational and energy-efficiency benefits. This work proposes the first end-to-end native FP4 training paradigm: all operations—including forward pass, backward pass, and gradient updates—are executed exclusively in hardware-supported FP4 floating-point arithmetic, eliminating mixed-precision fallbacks. We design Blackwell-optimized CUDA kernels and a dynamic numerical range calibration mechanism to ensure numerical stability and convergence under full FP4 precision. Furthermore, we establish the first low-precision scaling law for FP4, theoretically demonstrating its near-optimal trade-off between accuracy and computation. Evaluated on Llama-family models, our method achieves state-of-the-art FP4 training accuracy, successfully trains billion-parameter models, and matches the performance of both FP8 and standard-precision (e.g., FP16/BF16) training.

Technology Category

Application Category

📝 Abstract

The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a"near-optimal"low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.

Problem

Research questions and friction points this paper is trying to address.

Reducing computational demands in large language model training

Overcoming accuracy loss in FP4 precision training

Enabling efficient end-to-end FP4 training for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP4 native training for LLMs

Optimized CUDA kernels for Blackwell GPUs

New low-precision scaling law

🔎 Similar Papers

To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability