🤖 AI Summary
This study addresses the significant yet underexplored impact of system-level design on energy consumption in large language model (LLM) inference. Through empirical evaluation on an NVIDIA H100 platform, we systematically analyze how numerical precision, dynamic batching, and request scheduling—implemented via quantization, dynamic batching, and Hugging Face TGI—affect inference energy efficiency and latency. Our findings reveal that low-precision computation reduces energy only during compute-intensive phases, dynamic batching substantially improves decoding energy efficiency, and structured request scheduling can reduce per-request energy consumption by up to two orders of magnitude. Building on these insights, we propose a phase-aware energy efficiency analysis framework that offers both theoretical grounding and practical guidance for optimizing LLM inference systems.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in production, contributing towards shifting the burden in terms of computational resources and energy demands from training to inference. While prior work has examined the energy cost of inference per prompt or per token, we highlight how \emph{system-level design choices} - such as numerical precision, batching strategy, and request scheduling - can lead to orders-of-magnitude differences in energy consumption for the same model. We perform a detailed empirical study of LLM inference energy and latency on NVIDIA H100 GPUs, analyzing the impact of quantization, batch size, and serving configuration (e.g., with Hugging Face's Text Generation Inference server). Our results reveal that lower-precision formats only yield energy gains in compute-bound regimes; that batching improves energy efficiency, especially in memory-bound phases like decoding; and that structured request timing (arrival shaping) can reduce per-request energy by up to 100 times. We argue that sustainable LLM deployment depends not only on model internals, but also on the orchestration of the serving stack. Our findings motivate phase-aware energy profiling and system-level optimizations for greener AI services.