Progressive Mixed-Precision Decoding for Efficient LLM Inference

📅 2024-10-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the memory bandwidth bottleneck in LLM inference on resource-constrained devices and the substantial accuracy degradation caused by low-bit quantization, this paper proposes Phase-Aware Progressive Mixed-Precision Decoding (PMPD). Unlike uniform quantization, PMPD introduces stage-differentiated precision allocation: maintaining high precision during prefilling to preserve contextual modeling capability, while dynamically reducing precision along sequence depth during decoding, guided by task- and prompt-aware scheduling. Technically, PMPD integrates mixed-precision quantization, phase-aware computational modeling, and GEMV (General Matrix-Vector multiplication) acceleration. Experiments demonstrate 1.4–12.2× GEMV speedup on NVIDIA GPUs; on LLM-dedicated NPUs, it achieves 3.8–8.0× throughput over fp16 baselines and up to 1.54× higher throughput than uniform quantization—without compromising output quality.

Technology Category

Application Category

📝 Abstract
In spite of the great potential of large language models (LLMs) across various tasks, their deployment on resource-constrained devices remains challenging due to their excessive computational and memory demands. Quantization has emerged as an effective solution by storing weights in reduced precision. However, utilizing low precisions (i.e.~2/3-bit) to substantially alleviate the memory-boundedness of LLM decoding, still suffers from prohibitive performance drop. In this work, we argue that existing approaches fail to explore the diversity in computational patterns, redundancy, and sensitivity to approximations of the different phases of LLM inference, resorting to a uniform quantization policy throughout. Instead, we propose a novel phase-aware method that selectively allocates precision during different phases of LLM inference, achieving both strong context extraction during prefill and efficient memory bandwidth utilization during decoding. To further address the memory-boundedness of the decoding phase, we introduce Progressive Mixed-Precision Decoding (PMPD), a technique that enables the gradual lowering of precision deeper in the generated sequence, together with a spectrum of precision-switching schedulers that dynamically drive the precision-lowering decisions in either task-adaptive or prompt-adaptive manner. Extensive evaluation across diverse language tasks shows that when targeting Nvidia GPUs, PMPD achieves 1.4$-$12.2$ imes$ speedup in matrix-vector multiplications over fp16 models, while when targeting an LLM-optimized NPU, our approach delivers a throughput gain of 3.8$-$8.0$ imes$ over fp16 models and up to 1.54$ imes$ over uniform quantization approaches while preserving the output quality.
Problem

Research questions and friction points this paper is trying to address.

Addresses LLM inference on resource-constrained devices.
Proposes phase-aware precision allocation for LLM inference.
Introduces Progressive Mixed-Precision Decoding for efficiency.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Phase-aware precision allocation for LLM inference
Progressive Mixed-Precision Decoding (PMPD) technique
Dynamic precision-switching schedulers for efficiency
🔎 Similar Papers
No similar papers found.