🤖 AI Summary
Existing studies lack a systematic understanding of the trade-offs among inference latency, energy consumption, and generation quality when quantizing large language models (LLMs) in real-world online serving environments. Method: We conduct the first cross-layer empirical evaluation—spanning application, system, and hardware layers—assessing 11 post-training quantization methods across four model scales on A100 and H100 GPUs. We introduce qMeter, an automated online profiling framework that characterizes task semantics, workload dynamics, parallelization strategies, and hardware interactions. Contribution/Results: Our analysis reveals strong task dependency and architecture sensitivity in quantization efficacy, yielding empirically grounded multi-objective trade-off principles for deployment. Based on findings, we propose three optimization paradigms—capacity planning, energy-aware scheduling, and multi-objective tuning—thereby bridging a critical gap in holistic evaluation and co-optimization of LLM quantization for production-grade serving.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across diverse domains, but their heavy resource demands make quantization-reducing precision to lower-bit formats-critical for efficient serving. While many quantization methods exist, a systematic understanding of their performance, energy, and quality tradeoffs in realistic serving conditions remains a gap. In this work, we first develop a fully automated online characterization framework qMeter, and then conduct an in-depth characterization of 11 post-training LLM quantization methods across 4 model sizes (7B-70B) and two GPU architectures (A100, H100). We evaluate quantization at the application, workload, parallelism, and hardware levels under online serving conditions. Our study reveals highly task- and method-dependent tradeoffs, strong sensitivity to workload characteristics, and complex interactions with parallelism and GPU architecture. We further present three optimization case studies illustrating deployment challenges in capacity planning, energy-efficient scheduling, and multi-objective tuning. To the best of our knowledge, this is one of the first comprehensive application-, system-, and hardware-level characterization of LLM quantization from a joint performance, energy, and quality perspective.