Systematic Characterization of LLM Quantization: A Performance, Energy, and Quality Perspective

📅 2025-08-22

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing studies lack a systematic understanding of the trade-offs among inference latency, energy consumption, and generation quality when quantizing large language models (LLMs) in real-world online serving environments. Method: We conduct the first cross-layer empirical evaluation—spanning application, system, and hardware layers—assessing 11 post-training quantization methods across four model scales on A100 and H100 GPUs. We introduce qMeter, an automated online profiling framework that characterizes task semantics, workload dynamics, parallelization strategies, and hardware interactions. Contribution/Results: Our analysis reveals strong task dependency and architecture sensitivity in quantization efficacy, yielding empirically grounded multi-objective trade-off principles for deployment. Based on findings, we propose three optimization paradigms—capacity planning, energy-aware scheduling, and multi-objective tuning—thereby bridging a critical gap in holistic evaluation and co-optimization of LLM quantization for production-grade serving.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.

Problem

Research questions and friction points this paper is trying to address.

Systematically evaluating LLM quantization tradeoffs under realistic serving conditions

Analyzing performance, energy, and quality impacts across different quantization methods

Investigating hardware and workload dependencies in quantized LLM deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed automated framework qMeter

Characterized 11 quantization methods systematically

Evaluated application workload parallelism hardware levels

🔎 Similar Papers

An empirical study of LLaMA3 quantization: from LLMs to MLLMs

2024-04-22Vis. Intell.Citations: 20

Qualcomm

$140,800.00 - $211,200.00

San Diego, California, United States of America

Authors to Follow