🤖 AI Summary
This work addresses the longstanding challenge in large language model quantization—balancing high compression ratios, inference acceleration, and performance preservation. The authors propose a statistically lossless quantization framework that introduces two novel paradigms: task-lossless and distribution-lossless quantization. They define an interpretable fidelity metric, Expected Absolute Residual (EAR), and theoretically demonstrate that symmetric quantization noise variance follows a gamma-squared law, thereby justifying the necessity of asymmetric quantization. By integrating layer-wise non-uniform quantization, asymmetric strategies, and a wide-bitwidth search algorithm termed SLQ—alongside optimized inference kernels—the method achieves task-lossless compression at 3.3 bits per parameter and distribution-lossless compression at 5–6 bits per parameter, delivering 1.7–3.6× speedup over FP16 inference.
📝 Abstract
Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR >= 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at https://github.com/IST-DASLab/SLQ.