Elucidating the Design Space of FP4 training

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in FP4 training: fragmented design space, ambiguous computational overhead, and unclear stability–efficiency trade-offs. We propose the first systematic quantized gradient analysis framework, unifying the interplay of Hadamard transforms, tensor scaling, and stochastic rounding in forward/backward passes via theoretical modeling and large-scale simulation—thereby characterizing their joint impact on numerical stability and hardware throughput. Building on these insights, we devise a lightweight micro-scaling strategy and introduce a customized FP4 format, UE5M3, which significantly reduces computational cost while maintaining controlled accuracy degradation. Extensive evaluation across regression, image classification, diffusion models, and language models demonstrates that our approach improves training stability by up to 2.1× and hardware throughput by 37% over state-of-the-art methods—achieving, for the first time, Pareto-optimal balance among accuracy, stability, and hardware efficiency in FP4 training.

Technology Category

Application Category

📝 Abstract
The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point ( exttt{FP4}) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize exttt{FP4} training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of exttt{FP4} training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using exttt{UE5M3} as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.
Problem

Research questions and friction points this paper is trying to address.

Unifying the design space of FP4 low-precision training techniques
Analyzing computational costs of stabilization methods theoretically and empirically
Identifying optimal FP4 configurations for performance-overhead trade-off
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework for microscaling quantization gradient analysis
Simulator evaluating thousands of technique combinations
Optimal trade-off with Hadamard transformations and scaling
🔎 Similar Papers
No similar papers found.