Dissecting Outlier Dynamics in LLM NVFP4 Pretraining

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the significant performance gap between NVFP4 quantized training and BF16, which stems from outliers induced by NVFP4’s limited dynamic range. Through a longitudinal analysis of outlier locations, origins, and temporal evolution across model architectures, the study reveals consistent patterns in modules such as Softmax Attention, Linear Attention, and SwiGLU, and observes that outliers evolve from transient early spikes into persistent "hot channels" in later training stages. To mitigate this, the authors propose an online Hot Channel Compensation (HCP) mechanism coupled with a Coordinated Hybrid Online Training (CHON) scheme, integrating outlier tracking, hot channel identification, hardware-efficient residual reinjection, and post-QK operation protection. Evaluated on the GLA-1.3B model, this approach reduces the training loss gap between NVFP4 and BF16 from 0.94% to 0.58% while preserving downstream task accuracy.

Technology Category

Application Category

📝 Abstract

Training large language models using 4-bit arithmetic enhances throughput and memory efficiency. Yet, the limited dynamic range of FP4 increases sensitivity to outliers. While NVFP4 mitigates quantization error via hierarchical microscaling, a persistent loss gap remains compared to BF16. This study conducts a longitudinal analysis of outlier dynamics across architecture during NVFP4 pretraining, focusing on where they localize, why they occur, and how they evolve temporally. We find that, compared with Softmax Attention (SA), Linear Attention (LA) reduces per-tensor heavy tails but still exhibits persistent block-level spikes under block quantization. Our analysis attributes outliers to specific architectural components: Softmax in SA, gating in LA, and SwiGLU in FFN, with"post-QK"operations exhibiting higher sensitivity to quantization. Notably, outliers evolve from transient spikes early in training to a small set of persistent hot channels (i.e., channels with persistently large magnitudes) in later stages. Based on these findings, we introduce Hot-Channel Patch (HCP), an online compensation mechanism that identifies hot channels and reinjects residuals using hardware-efficient kernels. We then develop CHON, an NVFP4 training recipe integrating HCP with post-QK operation protection. On GLA-1.3B model trained for 60B tokens, CHON reduces the loss gap to BF16 from 0.94% to 0.58% while maintaining downstream accuracy.

Problem

Research questions and friction points this paper is trying to address.

outlier

LLM pretraining

FP4 quantization

quantization error

dynamic range

Innovation

Methods, ideas, or system contributions that make the work stand out.

outlier dynamics

NVFP4

hot-channel patch