Benchmarking Post-Training Quantization of Large Language Models under Microscaling Floating Point Formats

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study addresses the unexplored applicability of existing post-training quantization (PTQ) methods—primarily designed for integer formats—to microscaling floating-point (MXFP) representations. The authors systematically evaluate seven PTQ algorithms across three major families of large language models on 15 benchmarks, revealing for the first time that MXFP8 enables near-lossless compression, whereas MXFP4 suffers significant accuracy degradation. They further demonstrate that quantization sensitivity is predominantly governed by model architecture and propose a pre-scaling optimization strategy that effectively mitigates MXFP4 quantization errors. This work establishes practical guidelines for efficient PTQ tailored to MXFP formats, validates the viability of MXFP8 for deployment, and substantially improves MXFP4 performance, offering actionable insights for low-bit floating-point quantization in real-world applications.

Technology Category

Application Category

📝 Abstract

Microscaling Floating-Point (MXFP) has emerged as a promising low-precision format for large language models (LLMs). Despite various post-training quantization (PTQ) algorithms being proposed, they mostly focus on integer quantization, while their applicability and behavior under MXFP formats remain largely unexplored. To address this gap, this work conducts a systematic investigation of PTQ under MXFP formats, encompassing over 7 PTQ algorithms, 15 evaluation benchmarks, and 3 LLM families. The key findings include: 1) MXFP8 consistently achieves near-lossless performance, while MXFP4 introduces substantial accuracy degradation and remains challenging; 2) PTQ effectiveness under MXFP depends strongly on format compatibility, with some algorithmic paradigms being consistently more effective than others; 3) PTQ performance exhibits highly consistent trends across model families and modalities, in particular, quantization sensitivity is dominated by the language model rather than the vision encoder in multimodal LLMs; 4) The scaling factor of quantization is a critical error source in MXFP4, and a simple pre-scale optimization strategy can significantly mitigate its impact. Together, these results provide practical guidance on adapting existing PTQ methods to MXFP quantization.

Problem

Research questions and friction points this paper is trying to address.

Post-Training Quantization

Large Language Models

Microscaling Floating-Point

Low-Precision Formats

Quantization Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Microscaling Floating-Point

Post-Training Quantization

Large Language Models

MXFP8

Quantization Scaling

🔎 Similar Papers

No similar papers found.

Authors to Follow