🤖 AI Summary
This work addresses the limitations of existing data center power models in accurately capturing the rapid transitions of GPUs among prefill, decode, and idle states during large language model (LLM) inference, as well as the impact of multi-device synchronization on facility-level power demand. The authors propose a compositional modeling approach that decomposes LLM inference power consumption into workload-driven state transitions and configuration-dependent intra-state power profiles, enabling a learnable, multi-scale power trace generation framework. By integrating finite-state machine modeling, configuration-aware power fitting, and spatiotemporal aggregation, the method supports high-fidelity power trace synthesis from individual GPU servers to entire data centers. Evaluated across diverse LLMs, tensor parallelism configurations, and GPU platforms, it achieves median absolute energy errors below 5%, effectively facilitating infrastructure planning tasks such as over-provisioning analysis, power modulation, and grid-side load assessment.
📝 Abstract
Datacenter operators and electrical utilities rely on power traces at different spatiotemporal scales. Operators use fine-grained traces for provisioning, facility management, and scheduling, while utilities use site-level load profiles for capacity and interconnection planning. Existing datacenter power models do not capture LLM inference workloads, in which GPUs shift rapidly among compute-intensive prefill, lower-power decode, and idle states, and facility demand depends on how these states evolve and synchronize across many devices. We show that LLM inference power can be represented compositionally through two components: workload-driven transitions among operating states and configuration-specific power distributions within those states. Building on this observation, we develop a trace-generation framework that learns from measured traces and synthesizes power profiles for new traffic conditions and serving configurations. These traces aggregate from GPU servers to rack-, row-, and facility-scale load profiles at the temporal granularity required by the study.
Across multiple LLMs, tensor-parallel settings, and GPU generations, our framework achieves median absolute energy error below 5% for most configurations while preserving temporal autocorrelation structure. The resulting traces support downstream analyses including oversubscription, power modulation, and utility-facing load characterization, enabling infrastructure evaluations that flat nameplate assumptions and static trace replay cannot support.