Towards Superior Quantization Accuracy: A Layer-sensitive Approach

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Existing uniform quantization schemes for large language models (LLMs) overlook inter-layer sensitivity variations, leading to significant accuracy degradation. To address this, this paper proposes a layer-sensitive mixed-precision quantization method. We introduce the first joint modeling of activation sensitivity and weight distribution kurtosis to characterize per-layer quantization difficulty, and design two complementary mechanisms—SensiBoost (for activation-aware precision allocation) and KurtBoost (for kurtosis-driven weight quantization configuration)—to enable dynamic bit-width assignment. Our approach integrates layer-wise sensitivity analysis, activation sensitivity modeling, and kurtosis-guided weight quantization. Evaluated on the LLaMA family, it achieves up to a 9% reduction in perplexity under 4-bit quantization, with only a 2% increase in memory overhead. This work establishes a new paradigm for efficient LLM deployment by substantially improving low-bit quantization fidelity while preserving computational efficiency.

Technology Category

Application Category

📝 Abstract

Large Vision and Language Models have exhibited remarkable human-like intelligence in tasks such as natural language comprehension, problem-solving, logical reasoning, and knowledge retrieval. However, training and serving these models require substantial computational resources, posing a significant barrier to their widespread application and further research. To mitigate this challenge, various model compression techniques have been developed to reduce computational requirements. Nevertheless, existing methods often employ uniform quantization configurations, failing to account for the varying difficulties across different layers in quantizing large neural network models. This paper tackles this issue by leveraging layer-sensitivity features, such as activation sensitivity and weight distribution Kurtosis, to identify layers that are challenging to quantize accurately and allocate additional memory budget. The proposed methods, named SensiBoost and KurtBoost, respectively, demonstrate notable improvement in quantization accuracy, achieving up to 9% lower perplexity with only a 2% increase in memory budget on LLama models compared to the baseline.

Problem

Research questions and friction points this paper is trying to address.

Improves quantization accuracy in large neural networks

Addresses varying quantization difficulties across model layers

Reduces computational resources for model training and serving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-sensitive quantization for accuracy improvement

SensiBoost and KurtBoost enhance quantization precision

Targeted memory allocation for challenging layers

🔎 Similar Papers

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration