Optimal Formats for Weight Quantisation

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the lack of systematic theoretical guidance in deep learning weight quantization format design. We propose a quantization format modeling framework that minimizes the KL divergence between quantized and full-precision model outputs—proven equivalent to minimizing parameter-wise squared error. We establish, for the first time, rigorous connections between quantization formats, the Fisher information matrix, and classical rate-distortion theory. Leveraging this, we derive a variable-length entropy-coding format that is provably optimal in squared error, and formulate a Fisher-information-based strategy for optimal bit-width allocation across tensor layers. Experiments demonstrate that our variable-length format significantly outperforms fixed-length baselines; block quantization and sparse outlier formats inherently benefit from variable-length encoding; and on language models, our method achieves up to 0.25 bits/parameter compression gain.

Technology Category

Application Category

📝 Abstract

Weight quantisation is an essential technique for enabling efficient training and deployment of modern deep learning models. However, the recipe book of quantisation formats is large and the formats are often chosen empirically. In this paper, we propose a framework for systematic design and analysis of quantisation formats. By connecting the question of format design with the classical quantisation theory, we show that the strong practical performance of popular formats comes from their ability to represent values using variable-length codes. Framing the optimisation problem as minimising the KL divergence between the original and quantised model outputs, the objective is aligned with minimising the squared quantisation error of the model parameters. We therefore develop and evaluate squared-error-optimal formats for known distributions, observing significant improvement of variable-length codes over fixed-length codes. Uniform quantisation followed by lossless compression with a variable-length code is shown to be optimal. However, we find that commonly used block formats and sparse outlier formats also outperform fixed-length codes, implying they also exploit variable-length encoding. Finally, by using the relationship between the Fisher information and KL divergence, we derive the optimal allocation of bit-widths to individual parameter tensors across the model's layers, saving up to 0.25 bits per parameter when tested with direct-cast quantisation of language models.

Problem

Research questions and friction points this paper is trying to address.

Systematic design of optimal weight quantisation formats

Minimising KL divergence for quantised model outputs

Optimal bit-width allocation across model layers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Variable-length codes for weight quantisation

Squared-error-optimal formats design

Optimal bit-width allocation via Fisher information

🔎 Similar Papers

Effective Interplay between Sparsity and Quantization: From Theory to Practice