HiFloat4 Format for Language Model Inference

📅 2026-02-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of balancing accuracy and efficiency in low-bit quantization for language model inference. To this end, the authors propose HiFloat4 (HiF4), a block floating-point representation tailored for deep learning that employs a shared 32-bit three-level scaling factor across 64 elements. This design preserves a high dynamic range while enabling efficient fixed-point matrix operations, achieving both hardware-friendly implementation and strong representational capacity. HiF4 significantly outperforms the existing NVFP4 format in terms of numerical fidelity. Experimental results demonstrate that HiF4 consistently achieves higher average accuracy across various downstream tasks when applied to prominent language models, including LLaMA, Qwen, and Mistral.

Technology Category

Application Category

📝 Abstract
This paper introduces HiFloat4 (HiF4), a block floating-point data format tailored for deep learning. Each HiF4 unit packs 64 4-bit elements with 32 bits of shared scaling metadata, averaging 4.5 bits per value. The metadata specifies a three-level scaling hierarchy, capturing inter- and intra-group dynamic range while improving the utilization of the representational space. In addition, the large 64-element group size enables matrix multiplications to be executed in a highly fixed-point manner, significantly reducing hardware area and power consumption. To evaluate the proposed format, we conducted inference experiments on several language models, including LLaMA, Qwen, Mistral, DeepSeek-V3.1 and LongCat. Results show that HiF4 achieves higher average accuracy than the state-of-the-art NVFP4 format across multiple models and diverse downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

low-bit quantization
language model inference
block floating-point
hardware efficiency
data format
Innovation

Methods, ideas, or system contributions that make the work stand out.

HiFloat4
block floating-point
4-bit quantization
matrix multiplication optimization
language model inference
🔎 Similar Papers
No similar papers found.