🤖 AI Summary
This work addresses the challenge of balancing accuracy and efficiency in low-bit quantization for language model inference. To this end, the authors propose HiFloat4 (HiF4), a block floating-point representation tailored for deep learning that employs a shared 32-bit three-level scaling factor across 64 elements. This design preserves a high dynamic range while enabling efficient fixed-point matrix operations, achieving both hardware-friendly implementation and strong representational capacity. HiF4 significantly outperforms the existing NVFP4 format in terms of numerical fidelity. Experimental results demonstrate that HiF4 consistently achieves higher average accuracy across various downstream tasks when applied to prominent language models, including LLaMA, Qwen, and Mistral.
📝 Abstract
This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.