The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

📅 2026-03-11

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the instability of large language models during FP4 (W4A4G4) quantization-aware training, which stems from severe anisotropy causing dynamic range explosion and numerical instability. The authors identify, for the first time, a pervasive rank-one mean bias—present across layers and training stages—as the root cause of this instability in low-bit training. To mitigate its effect, they propose subtracting the mean at the source prior to quantization, a lightweight reduction operation that effectively suppresses dynamic range expansion and achieves stability comparable to SVD-based methods while drastically reducing hardware overhead. Experimental results demonstrate that this mean removal significantly narrows the loss gap between FP4 and BF16 training and restores downstream task performance, offering a practical pathway toward efficient and stable ultra-low-bit training of large models.

Technology Category

Application Category

📝 Abstract

Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

Problem

Research questions and friction points this paper is trying to address.

mean bias

low-bit quantization

LLM training

numerical instability

spectral anisotropy

Innovation

Methods, ideas, or system contributions that make the work stand out.

mean bias

low-bit quantization

anisotropy

FP4 training

rank-one correction

🔎 Similar Papers

No similar papers found.

ByteDance

United States / China / Singapore

Authors to Follow