Grid Games: The Power of Multiple Grids for Quantizing Large Language Models

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

This work addresses the limited representational capacity of a single floating-point grid in 4-bit quantization by proposing a multi-grid quantization approach that dynamically selects among multiple 4-bit grids. By introducing a small number of additional bits to indicate the chosen grid, the method enables finer numerical representation. The paper formally introduces the “Pick-One-of-Two” (PO2) grid selection problem and theoretically demonstrates that multi-grid quantization significantly outperforms single-grid schemes under small group sizes. It further designs several learnable or structured multi-grid families—including PO2(NF4), MPO2, PO2(Split87), and SFP4—that consistently surpass FP4 across both weight-only and weight-plus-activation quantization settings on standard open-source and Llama-family models, yielding substantial improvements in model accuracy.

📝 Abstract

A major recent advance in quantization is given by microscaled 4-bit formats such as NVFP4 and MXFP4, quantizing values into small groups sharing a scale, assuming a fixed floating-point grid. In this paper, we study the following natural extension: assume that, for each group of values, we are free to select the "better" among two or more 4-bit grids marked by one or more bits in the scale value. We formalize the power-of-two-grids (PO2) problem, and provide theoretical results showing that practical small-group formats such as MXFP or NVFP can benefit significantly from PO2 grids, while the advantage vanishes for very large groups. On the practical side, we instantiate several grid families, including 1) PO2(NF4), which pairs the standard NF4 normal grid with a learned grid, 2) MPO2, a grid pair that is fully learned over real weights and activations, 3) PO2(Split87), an explicit-zero asymmetric grid and 4) SFP4, a TensorCore-implementable triple which pairs NVFP4 with two shifted variants. Results for post-training quantization of standard open models and pre-training of Llama-like models show that adaptive grids consistently improve accuracy vs single-grid FP4 under both weight-only and weight+activation. Source code is available at https://github.com/IST-DASLab/GridGames.

Problem

Research questions and friction points this paper is trying to address.

quantization

large language models

multiple grids

4-bit formats

floating-point grid

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-grid quantization

adaptive grids

4-bit quantization