throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving

📅 2024-08-05
🏛️ International Symposium on High-Performance Computer Architecture
📈 Citations: 5
Influential: 2
📄 PDF
🤖 AI Summary
To address the challenge of balancing high GPU energy consumption and strict service-level objective (SLO) guarantees in large language model (LLM) inference serving, this paper proposes a fine-grained joint optimization framework for GPU frequency scaling and instance sizing, grounded in KV-cache modeling and batch-size trend prediction. Innovatively, it introduces an iteration-level performance predictor to drive dynamic voltage and frequency scaling (DVFS) and elastic scaling, implemented via a lightweight ML model (R² > 0.97). The framework achieves co-optimization of energy efficiency and latency under SLO constraints. Compared to NVIDIA Triton, it reduces GPU energy consumption by 43.8% while meeting SLOs, improves energy efficiency by over 1.71×, and maintains prediction error below 1 iteration/s.

Technology Category

Application Category

📝 Abstract
As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present throttLL’eM, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. throttLL’eM features mechanisms that project future Key-Value (KV) cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, throttLL’eM manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves $R^{2}$ scores greater than 0.97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that throttLL’eM achieves up to $mathbf{4 3. 8 %}$ lower energy consumption and an energy efficiency improvement of at least $1.71 imes$ under SLOs, when compared to NVIDIA’s Triton server. throttLL’eM is publicly available at https://github.com/WilliamBlaskowicz/throttLL-eM.
Problem

Research questions and friction points this paper is trying to address.

Reducing energy consumption of LLM inference under SLO constraints
Optimizing GPU frequency scaling for energy-efficient LLM serving
Managing performance via KV cache and batch size projections
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses instance and GPU frequency scaling for energy efficiency
Predicts future KV cache usage and batch size via ML model
Manages performance at iteration level to meet SLOs
🔎 Similar Papers
No similar papers found.