Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the absence of energy-aware benchmarks for large language model (LLM) inference across heterogeneous GPU architectures, a gap that hinders energy-efficient deployment. We present the first large-scale, open-source energy consumption dataset, comprising over 5,000 inference runs of 50 LLMs across 10 NVIDIA GPU models under both batch and server scenarios, accompanied by a reproducible and extensible benchmarking framework. Our study provides the first systematic quantification of energy disparities in LLM inference across diverse GPUs, revealing the critical impact of hardware selection on energy efficiency. Experimental results demonstrate that hardware-aware deployment can reduce energy consumption by up to 70% in server settings and by up to 20% in batch processing, with negligible impact on user-perceived latency, thereby advancing a green, hardware-conscious deployment paradigm for LLMs.

Technology Category

Application Category

📝 Abstract

While the large energy consumption of Large Language Models (LLMs) is recognized by the community, system operators lack guidance for energy-efficient LLM inference deployments that leverage energy trade-offs of heterogeneous hardware due to a lack of energy-aware benchmarks and data. In this work we address this gap with Watt Counts: the largest open-access dataset of energy consumption of LLMs, with over 5,000 experiments for 50 LLMs across 10 NVIDIA Graphics Processing Units (GPUs) in batch and server scenarios along with a reproducible, open-source benchmark that enables community submissions to expand this dataset. Leveraging this dataset, we conduct a system-level study of LLM inference across heterogeneous GPU architectures and show that GPU selection is crucial for energy efficiency outcomes and that optimal hardware choices vary significantly across models and deployment scenarios, demonstrating the critical importance of hardware-aware deployment in heterogeneous LLM systems. Guided by our data and insights, we show that practitioners can reduce energy consumption by up to 70% in server scenarios with negligible impact on user experience, and by up to 20% in batch scenarios.

Problem

Research questions and friction points this paper is trying to address.

energy-aware benchmark

LLM inference

heterogeneous GPU architectures

energy consumption

sustainable AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

energy-aware benchmark

heterogeneous GPU architectures

LLM inference