Mixed-Precision Quantization for Language Models: Techniques and Prospects

📅 2025-10-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Scaling large language models (LLMs) leads to unsustainable computational, memory, and energy costs. This paper proposes a hardware-aware mixed-precision quantization system for LLMs: it establishes a unified taxonomy and systematically analyzes bit-width allocation across weights, activations, and KV caches; designs hardware-efficient uniform and non-uniform quantizers integrating post-training quantization, fine-grained quantization control, and scalable optimization algorithms to support sub-INT8 compression. Compared to uniform quantization, our approach reduces memory footprint by up to 62% and inference latency by up to 3.1×, while preserving perplexity and zero-shot task accuracy. The core contribution lies in empirically characterizing the heterogeneous precision sensitivity of LLM components and delivering the first holistic mixed-precision quantization framework that jointly addresses theoretical analysis, systems implementation, and hardware co-design.

Technology Category

Application Category

📝 Abstract
The rapid scaling of language models (LMs) has resulted in unprecedented computational, memory, and energy requirements, making their training and deployment increasingly unsustainable. Quantization has emerged as an essential compression technique to reduce model size, alleviate memory bottlenecks, and accelerate inference. However, while uniform low-bit quantization (e.g., INT8, INT4) provides significant efficiency gains, it can degrade accuracy in sensitive components of transformer-based LMs. Mixed-precision quantization offers a promising alternative by selectively allocating precision across layers or within tensors to balance efficiency and accuracy. This survey provides a comprehensive overview of Mixed-Precision quantization frameworks for LMs (MXPLMs). We first review quantization fundamentals, including uniform and non-uniform quantizers, quantization granularity, and methods widely used in post-training quantization. We then categorize and compare recent MXPLM frameworks according to their bit allocation strategies and precision configurations across weights, activations, and key-value caches. A comparative analysis highlights differences in perplexity, zero-shot task performance, and deployment trade-offs. Furthermore, we contrast MXPLMs with earlier mixed-precision quantization methods for deep neural networks, identifying strategies that transfer and those that face challenges in the LM setting. Finally, we summarize open issues and future directions, including hardware-aware design, activation quantization, and scalable optimization methods for billion-parameter models. By consolidating recent advances, this work serves as a reference for understanding the current landscape and research prospects of mixed-precision quantization for large-scale language models.
Problem

Research questions and friction points this paper is trying to address.

Reducing computational and memory costs of large language models
Balancing efficiency and accuracy in transformer model quantization
Optimizing mixed-precision allocation across neural network components
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixed-precision quantization balances efficiency and accuracy
Selective bit allocation across layers and tensors
Hardware-aware design for billion-parameter model optimization
🔎 Similar Papers
No similar papers found.
M
Mariam Rakka
University of California Irvine, Irvine, CA, USA
M
Marios Fournarakis
Wayve AI, London, UK
O
Olga Krestinskaya
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
J
Jinane Bazzi
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
K
Khaled N. Salama
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Fadi Kurdahi
Fadi Kurdahi
University of California Irvine, Irvine, CA, USA
A
Ahmed M. Eltawil
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Mohammed E. Fouda
Mohammed E. Fouda
Unknown affiliation