Adaptive Block-Scaled Data Types

๐Ÿ“… 2026-03-30
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the significant accuracy degradation caused by existing 4-bit quantization formatsโ€”such as NVFP4โ€”when representing large-magnitude values. To mitigate this issue, the authors propose a block-wise adaptive mixed-precision quantization scheme that dynamically selects between INT4 and FP4 representations for every group of 16 values, reusing the sign bit of the shared scaling factor to indicate the format type. The approach is generalized to other bit widths, yielding a unified IFx family (e.g., IF3, IF6). Coupled with E4M3 scaling factors and a dedicated IF4 multiply-accumulate unit, the method enables efficient hardware deployment. Experiments demonstrate that IF4 consistently outperforms current 4-bit quantization strategies in both training and post-training settings, substantially reducing language modeling loss and improving accuracy across multiple downstream tasks.
๐Ÿ“ Abstract
NVFP4 has grown increasingly popular as a 4-bit format for quantizing large language models due to its hardware support and its ability to retain useful information with relatively few bits per parameter. However, the format is not without limitations: recent work has shown that NVFP4 suffers from its error distribution, resulting in large amounts of quantization error on near-maximal values in each group of 16 values. In this work, we leverage this insight to design new Adaptive Block-Scaled Data Types that can adapt to the distribution of their input values. For four-bit quantization, our proposed IF4 (Int/Float 4) data type selects between FP4 and INT4 representations for each group of 16 values, which are then scaled by an E4M3 scale factor as is done with NVFP4. The selected data type is denoted using the scale factor's sign bit, which is currently unused in NVFP4, and we apply the same insight to design formats for other bit-widths, including IF3 and IF6. When used to quantize language models, we find that IF4 outperforms existing 4-bit block-scaled formats, achieving lower loss during quantized training and achieving higher accuracy on many tasks in post-training quantization. We additionally design and evaluate an IF4 Multiply-Accumulate (MAC) unit to demonstrate that IF4 can be implemented efficiently in next-generation hardware accelerators. Our code is available at https://github.com/mit-han-lab/fouroversix.
Problem

Research questions and friction points this paper is trying to address.

quantization error
NVFP4
large language models
block-scaled data types
4-bit format
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adaptive quantization
Block-scaled data types
IF4
4-bit quantization
Hardware-efficient LLMs
๐Ÿ”Ž Similar Papers
No similar papers found.