Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

217K/year
🤖 AI Summary
This work addresses the absence of energy-aware benchmarks for large language model (LLM) inference across heterogeneous GPU architectures, a gap that hinders energy-efficient deployment. We present the first large-scale, open-source energy consumption dataset, comprising over 5,000 inference runs of 50 LLMs across 10 NVIDIA GPU models under both batch and server scenarios, accompanied by a reproducible and extensible benchmarking framework. Our study provides the first systematic quantification of energy disparities in LLM inference across diverse GPUs, revealing the critical impact of hardware selection on energy efficiency. Experimental results demonstrate that hardware-aware deployment can reduce energy consumption by up to 70% in server settings and by up to 20% in batch processing, with negligible impact on user-perceived latency, thereby advancing a green, hardware-conscious deployment paradigm for LLMs.

Technology Category

Application Category

📝 Abstract
While the large energy consumption of Large Language Models (LLMs) is recognized by the community, system operators lack guidance for energy-efficient LLM inference deployments that leverage energy trade-offs of heterogeneous hardware due to a lack of energy-aware benchmarks and data. In this work we address this gap with Watt Counts: the largest open-access dataset of energy consumption of LLMs, with over 5,000 experiments for 50 LLMs across 10 NVIDIA Graphics Processing Units (GPUs) in batch and server scenarios along with a reproducible, open-source benchmark that enables community submissions to expand this dataset. Leveraging this dataset, we conduct a system-level study of LLM inference across heterogeneous GPU architectures and show that GPU selection is crucial for energy efficiency outcomes and that optimal hardware choices vary significantly across models and deployment scenarios, demonstrating the critical importance of hardware-aware deployment in heterogeneous LLM systems. Guided by our data and insights, we show that practitioners can reduce energy consumption by up to 70% in server scenarios with negligible impact on user experience, and by up to 20% in batch scenarios.
Problem

Research questions and friction points this paper is trying to address.

energy-aware benchmark
LLM inference
heterogeneous GPU architectures
energy consumption
sustainable AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

energy-aware benchmark
heterogeneous GPU architectures
LLM inference
sustainable AI
hardware-aware deployment
🔎 Similar Papers
No similar papers found.