M2XFP: A Metadata-Augmented Microscaling Data Format for Efficient Low-bit Quantization

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

182K/year

🤖 AI Summary

This work addresses the significant accuracy degradation in existing low-bit microscaling formats—such as MXFP4—caused by shared power-of-two scaling factors. The authors propose a novel microscaling quantization format that integrates lightweight metadata through algorithm-hardware co-design, enabling online quantization and efficient encoding with minimal storage overhead. By introducing a flexible metadata mechanism, the approach supports lightweight hardware acceleration units. Evaluated on large language model benchmarks, the method reduces accuracy loss by 70.63% on average compared to MXFP4 (and by 37.30% relative to NVFP4), while achieving up to 1.91× speedup and 1.75× higher energy efficiency.

Technology Category

Application Category

📝 Abstract

Existing low-bit Microscaling (MX) formats, such as MXFP4, often suffer from substantial accuracy degradation due to the use of a shared scaling factor with the Power-of-Two format. In this work, we explore strategies that introduce minimal metadata to recover accuracy lost during quantization while maintaining high bit efficiency across a wide range of large language models. We propose a complete algorithm-hardware co-design based on flexible metadata, featuring an online quantization with simple encoding. To support the proposed method efficiently, we implement a lightweight hardware unit and integrate it into the accelerator. Evaluation results demonstrate that our method substantially narrows the accuracy gap, achieving on average a 70.63% reduction in accuracy loss compared to MXFP4 and a 37.30% reduction relative to the latest NVFP4 on LLM benchmarks. Furthermore, our design delivers up to 1.91$\times$ speedup and 1.75$\times$ energy savings over state-of-the-art accelerators. Our code is available at https://github.com/SJTU-ReArch-Group/M2XFP_ASPLOS26.

Problem

Research questions and friction points this paper is trying to address.

low-bit quantization

Microscaling

accuracy degradation

scaling factor

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

M2XFP

low-bit quantization

metadata-augmented

algorithm-hardware co-design

Microscaling

🔎 Similar Papers

A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

2024-09-25arXiv.orgCitations: 19