Where Do the Joules Go? Diagnosing Inference Energy Consumption

πŸ“… 2026-01-29
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the lack of systematic diagnosis of energy consumption disparities in generative AI inference, which hinders effective energy-efficiency optimization. Through a large-scale empirical analysis measuring 1,858 configurations across 46 models and 7 task types on H100 and B200 GPUs, the work proposes the first cross-layer energy diagnostic framework tailored for generative AI inference. It attributes inference latency and energy consumption to latent factors across algorithmic, software, and hardware layersβ€”such as memory access patterns and GPU utilization. Key findings reveal that task type alone can induce up to 25Γ— energy differences, video generation consumes over 100Γ— more energy than image generation, and variations in GPU utilization cause 3–5Γ— energy fluctuations. These insights provide quantitative foundations and actionable pathways for designing energy-efficient generative AI systems.

Technology Category

Application Category

πŸ“ Abstract
Energy is now a critical ML computing resource. While measuring energy consumption and observing trends is a valuable first step, accurately understanding and diagnosing why those differences occur is crucial for optimization. To that end, we begin by presenting a large-scale measurement study of inference time and energy across the generative AI landscape with 46 models, 7 tasks, and 1,858 different configurations on NVIDIA H100 and B200 GPUs. Our empirical findings span order-of-magnitude variations: LLM task type can lead to 25$\times$ energy differences, video generation sometimes consumes more than 100$\times$ the energy of images, and GPU utilization differences can result in 3--5$\times$ energy differences. Based on our observations, we present a framework for reasoning about the underlying mechanisms that govern time and energy consumption. The essence is that time and energy are determined by latent metrics like memory and utilization, which are in turn affected by various factors across the algorithm, software, and hardware layers. Our framework also extends directly to throughput per watt, a critical metric for power-constrained datacenters.
Problem

Research questions and friction points this paper is trying to address.

energy consumption
inference
generative AI
GPU utilization
power efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

energy consumption
inference optimization
generative AI
GPU utilization
throughput per watt
πŸ”Ž Similar Papers
No similar papers found.