The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

258K/year
🤖 AI Summary
This work addresses the underutilization of GPU power limits during autoregressive decoding in current LLM inference services, which undermines energy-efficiency optimizations. The authors propose a phase-aware energy profiling methodology that, for the first time, reveals this inefficiency stems from insufficient computational load during decoding, failing to activate the GPU’s power cap. To remedy this, they introduce an SM clock locking strategy that reduces decoding energy consumption by up to 32% with negligible throughput degradation. Through systematic comparisons across diverse architectures—including GQA, MLA, Gated DeltaNet, and Mamba2—the study identifies three distinct DVFS behavioral patterns among attention mechanisms and uncovers a consistent energy-efficiency trend: newer architectures achieve nearly 50% lower total request energy than GQA under production-scale workloads.
📝 Abstract
Power capping is the standard GPU energy lever in LLM serving, and it appears to work: throughput drops, power readings fall, and energy budgets are met. We show the appearance is illusory for the phase that dominates production serving: autoregressive decode. Across four attention paradigms -- GQA, MLA, Gated DeltaNet, and Mamba2 -- on NVIDIA H200, decode draws only 137--300\,W on a 700\,W GPU; no cap ever triggers, because memory-bound decode saturates HBM bandwidth rather than compute and leaves power headroom untouched. Firmware-initiated clock throttling compounds the illusion: these deviations can corrupt any throughput measurement that attributes them to the cap. SM clock locking dissolves both confounds. By targeting the lever that is actually on the critical path, clock locking Pareto-dominates power capping universally, recovering up to 32\% of decode energy at minimal throughput loss. We identify three architecture-dependent DVFS behavioural classes and characterise a common energy pattern across novel attention replacements: a heavy prefill cost recouped by efficient decode, eventually halving total request energy relative to GQA at production batch sizes.
Problem

Research questions and friction points this paper is trying to address.

power capping
LLM decode
energy characterization
attention architectures
memory-bound
Innovation

Methods, ideas, or system contributions that make the work stand out.

power capping
SM clock locking
energy characterization
attention architectures
LLM decode
🔎 Similar Papers
No similar papers found.