🤖 AI Summary
To address the significant performance degradation and operational overhead of maintaining multiple model variants induced by low-bit (e.g., int2/int4) quantization of large language models, this paper proposes Matryoshka Quantization—a multi-scale, nested quantization framework. It introduces a shared-weight co-training architecture leveraging cross-precision knowledge distillation and layer-wise feed-forward network (FFN) quantization to enable dynamic, on-the-fly inference at int2/int4/int8 granularities within a single model. Crucially, it is the first work to formally model integer bit-width nesting as a hierarchical quantization space, enabling seamless precision switching without retraining or model reloading. Experiments demonstrate that Gemma-2 9B quantized to int2 achieves up to 10% higher accuracy on key tasks compared to standard quantization-aware training (QAT), even outperforming the int8-quantized Gemma-2 2B—thereby substantially alleviating the accuracy-efficiency trade-off in LLM deployment.
📝 Abstract
Quantizing model weights is critical for reducing the communication and inference costs of large models. However, quantizing models -- especially to low precisions like int4 or int2 -- requires a trade-off in model quality; int2, in particular, is known to severely degrade model quality. Consequently, practitioners are often forced to maintain multiple models with different quantization levels or serve a single model that best satisfies the quality-latency trade-off. On the other hand, integer data types, such as int8, inherently possess a nested (Matryoshka) structure where smaller bit-width integers, like int4 or int2, are nested within the most significant bits. This paper proposes Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique that addresses the challenge of needing multiple quantized models. It allows training and maintaining just one model, which can then be served at different precision levels. Furthermore, due to the co-training and co-distillation regularization provided by MatQuant, the int2 precision models extracted by MatQuant can be up to $10%$ more accurate than standard int2 quantization (using techniques like QAT or OmniQuant). This represents significant progress in model quantization, demonstrated by the fact that, with the same recipe, an int2 FFN-quantized Gemma-2 9B model is more accurate than an int8 FFN-quantized Gemma-2 2B model.