🤖 AI Summary
To address the challenge of balancing high GPU energy consumption and strict service-level objective (SLO) guarantees in large language model (LLM) inference serving, this paper proposes a fine-grained joint optimization framework for GPU frequency scaling and instance sizing, grounded in KV-cache modeling and batch-size trend prediction. Innovatively, it introduces an iteration-level performance predictor to drive dynamic voltage and frequency scaling (DVFS) and elastic scaling, implemented via a lightweight ML model (R² > 0.97). The framework achieves co-optimization of energy efficiency and latency under SLO constraints. Compared to NVIDIA Triton, it reduces GPU energy consumption by 43.8% while meeting SLOs, improves energy efficiency by over 1.71×, and maintains prediction error below 1 iteration/s.
📝 Abstract
As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present throttLL’eM, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. throttLL’eM features mechanisms that project future Key-Value (KV) cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, throttLL’eM manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^{2}$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that throttLL’eM achieves up to $mathbf{4 3. 8 %}$ lower energy consumption and an energy efficiency improvement of at least $1.71 imes$ under SLOs, when compared to NVIDIA’s Triton server. throttLL’eM is publicly available at https://github.com/WilliamBlaskowicz/throttLL-eM.