DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

📅 2026-04-20
📈 Citations: 0
Influential: 0
📄 PDF

career value

222K/year
🤖 AI Summary
This work addresses the issue in MXFP4 quantization where activation outliers cause shared block-wise scaling factors to inflate, thereby compressing the dynamic range of non-outlier values and introducing significant quantization error. The paper presents the first outlier-aware fine-grained rotation method aligned with the MXFP4 format, matching the rotation block size to its microscaling group (32 elements). This alignment enables a single rotation to effectively smooth the weight distribution, eliminating the need for conventional dual rotations and zigzag permutations. As a result, the proposed approach achieves state-of-the-art accuracy in W4A4 quantization of LLaMA-3 family models while reducing the online rotation computational overhead by 50%.

Technology Category

Application Category

📝 Abstract
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant++.
Problem

Research questions and friction points this paper is trying to address.

outlier
quantization error
MXFP4
microscaling
activation outliers
Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained rotation
microscaling
outlier-aware quantization
MXFP4
LLM inference