Block Rotation is All You Need for MXFP4 Quantization

📅 2025-11-06

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing rotation-based quantization methods suffer severe accuracy degradation under MXFP4—a novel FP4 format—due to fundamental incompatibility with its power-of-two block-wise scaling mechanism. Method: We propose a block-level rotation strategy specifically designed for MXFP4, which jointly optimizes weight and activation rotation and scaling at the block level. Leveraging energy redistribution analysis, our method robustly handles outliers without altering the MXFP4 format definition or introducing computational or memory overhead. Contribution/Results: This work presents the first high-accuracy adaptation of rotation-based techniques to MXFP4. Extensive experiments across multiple mainstream large language models demonstrate substantial improvements in post-training quantization accuracy. Moreover, we establish the first systematic MXFP4 quantization benchmark, offering a new paradigm for efficient low-bit floating-point quantization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable success, but their rapidly growing scale imposes prohibitive costs in memory, computation, and energy. Post-training quantization (PTQ) is a promising solution for efficient deployment, yet achieving accurate W4A4 quantization remains an open challenge. While most existing methods are designed for INT4 formats, the emergence of MXFP4 -- a new FP4 format with various hardware support (NVIDIA, AMD, Intel)-- raises questions about the applicability of current techniques. In this work, we establish a comprehensive benchmark of PTQ methods under the MXFP4 format. Through systematic evaluation, we find that methods like GPTQ consistently deliver strong performance, whereas rotation-based approaches, which are almost used by all state-of-the-art approaches, suffer from severe incompatibility with MXFP4. We further provide the first in-depth analysis of this conflict, tracing its root to a fundamental mismatch between MXFP4's PoT (power-of-two) block scaling and the redistribution of outlier energy via global rotation. Building on this insight, we propose a simple yet effective block rotation strategy that adapts rotation-based methods to MXFP4, leading to substantial accuracy improvements across diverse LLMs. Our findings not only offer clear guidance for practitioners but also set a foundation for advancing PTQ research under emerging low-precision formats.

Problem

Research questions and friction points this paper is trying to address.

Evaluating post-training quantization methods for MXFP4 format compatibility

Analyzing rotation-based methods' incompatibility with MXFP4 block scaling

Proposing block rotation strategy to adapt quantization for MXFP4 format

Innovation

Methods, ideas, or system contributions that make the work stand out.

Block rotation adapts rotation-based methods to MXFP4

Strategy addresses MXFP4's power-of-two block scaling

Improves accuracy of low-precision quantization for LLMs

🔎 Similar Papers

No similar papers found.

Authors to Follow