"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

📅 2024-11-04

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

The precision-efficiency trade-offs of quantization formats in LLM inference remain poorly characterized. This work conducts a large-scale empirical study across the full Llama-3.1 model family, systematically evaluating FP8, INT8, and INT4 weight/activation joint quantization on academic benchmarks and real-world tasks. Our methodology integrates vLLM-based cross-GPU performance profiling, evaluation on >500K samples, generation consistency analysis, and quantization-aware tuning. Key findings include: (i) W8A8-FP achieves lossless quantization across all model scales; (ii) W8A8-INT incurs only 1–3% accuracy degradation after lightweight tuning; and (iii) W4A16 matches or exceeds 8-bit alternatives across multiple scenarios. Based on these results, we propose a deployment-aware quantization format selection guideline—distinguishing high-throughput asynchronous versus cost-sensitive synchronous inference—and establish FP8 as the precision reference, with W8A8-INT and W4A16-INT as state-of-the-art balanced solutions for their respective deployment modes.

Technology Category

Application Category

📝 Abstract

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results. Our investigation, encompassing over 500,000 individual evaluations, yields several key findings: (1) FP8 weight and activation quantization (W8A8-FP) is lossless across all model scales, (2) INT8 weight and activation quantization (W8A8-INT), when properly tuned, incurs surprisingly low 1-3% accuracy degradation, and (3) INT4 weight-only quantization (W4A16-INT) is competitive with 8-bit integer weight and activation quantization. To address the question of the"best"format for a given deployment environment, we conduct inference performance analysis using the popular open-source vLLM framework on various GPU architectures. We find that W4A16 offers the best cost-efficiency for synchronous deployments, and for asynchronous deployment on mid-tier GPUs. At the same time, W8A8 formats excel in asynchronous"continuous batching"deployment of mid- and large-size models on high-end GPUs. Our results provide a set of practical guidelines for deploying quantized LLMs across scales and performance requirements.

Problem

Research questions and friction points this paper is trying to address.

Evaluate accuracy-performance trade-offs in LLM quantization

Compare FP8, INT8, INT4 formats across Llama-3.1 model family

Identify optimal quantization formats for different deployment scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

FP8 ensures lossless model accuracy

INT8 maintains minimal accuracy loss

INT4 rivals 8-bit quantization performance

🔎 Similar Papers

No similar papers found.

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow