Block Rotation is All You Need for MXFP4 Quantization

πŸ“… 2025-11-06
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing rotation-based quantization methods suffer severe accuracy degradation under MXFP4β€”a novel FP4 formatβ€”due to fundamental incompatibility with its power-of-two block-wise scaling mechanism. Method: We propose a block-level rotation strategy specifically designed for MXFP4, which jointly optimizes weight and activation rotation and scaling at the block level. Leveraging energy redistribution analysis, our method robustly handles outliers without altering the MXFP4 format definition or introducing computational or memory overhead. Contribution/Results: This work presents the first high-accuracy adaptation of rotation-based techniques to MXFP4. Extensive experiments across multiple mainstream large language models demonstrate substantial improvements in post-training quantization accuracy. Moreover, we establish the first systematic MXFP4 quantization benchmark, offering a new paradigm for efficient low-bit floating-point quantization.

Technology Category

Application Category

πŸ“ Abstract
Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.
Problem

Research questions and friction points this paper is trying to address.

Evaluating post-training quantization methods for MXFP4 format compatibility
Analyzing rotation-based methods' incompatibility with MXFP4 block scaling
Proposing block rotation strategy to adapt quantization for MXFP4 format
Innovation

Methods, ideas, or system contributions that make the work stand out.

Block rotation adapts rotation-based methods to MXFP4
Strategy addresses MXFP4's power-of-two block scaling
Improves accuracy of low-precision quantization for LLMs
πŸ”Ž Similar Papers
No similar papers found.
Y
Yuantian Shao
Nanjing University of Science and Technology
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression
Y
Yuanteng Chen
C2DL, Institute of Automation, Chinese Academy of Sciences
C
Chang Xu
School of Computer Science, University of Sydney
Z
Zhihui Wei
Nanjing University of Science and Technology
J
Jian Cheng
C2DL, Institute of Automation, Chinese Academy of Sciences