Search Your Block Floating Point Scales!

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the high quantization error in traditional block floating-point (BFP) quantization, which relies on the maximum absolute value within each block to determine the shared scaling factor. To mitigate this limitation, the authors propose ScaleSearch, a novel method that formulates scaling factor selection as a search problem, enabling fine-grained optimization of mantissa bits in microscaled BFP formats such as NVFP4 to minimize quantization error. ScaleSearch incurs no additional computational overhead and integrates seamlessly into both post-training quantization (PTQ) pipelines and low-precision attention mechanisms. Experimental results demonstrate a 27% reduction in quantization error under NVFP4; PTQ performance of Qwen3-8B improves by 15 points on MATH500; and Llama 3.1 70B achieves up to a 0.77-point reduction in perplexity on Wikitext-2, closely matching the original model’s performance.

📝 Abstract

Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

Problem

Research questions and friction points this paper is trying to address.

Block Floating Point

quantization error

scale factor

low-precision computation

generative models

Innovation

Methods, ideas, or system contributions that make the work stand out.

ScaleSearch

Block Floating Point

Quantization Error Minimization