Position: LLM Inference Should Be Evaluated as Energy-to-Token Production

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Current evaluations of large language model (LLM) inference predominantly emphasize accuracy and latency while overlooking the impact of physical deployment constraints—such as computational resources, power supply, and cooling—on real-world throughput efficiency. This work reframes LLM inference as an “energy-to-token production process” and introduces, for the first time, an evaluation framework centered on joules per token as the core metric. It proposes a Token Production Function to jointly model generation efficiency under both computational and energy constraints. The framework integrates energy-saving techniques including KV cache compression, sparse attention, quantization, routing strategies, and difficulty-adaptive decoding, and advocates reporting power consumption and utilization metrics corrected by Power Usage Effectiveness (PUE). By uncovering the physical bottlenecks underlying API pricing disparities, this study urges the community to shift focus from peak compute capacity toward holistic optimization of actual energy delivery and operational efficiency.

📝 Abstract

LLM inference is still evaluated mainly as a model or software problem: accuracy, latency, throughput, and hardware utilization. This is incomplete. At deployment scale, the relevant output is a quality-conditioned token produced under joint constraints from effective compute, delivered data-center power, cooling capacity, PUE, and utilization. We argue that the ML community should treat inference as \emph{energy-to-token production}. We formalize this view with a dimensionally consistent Token Production Function in which token rate is bounded by both compute-per-token and energy-per-token ceilings. Listed API prices vary by over an order of magnitude across providers, but we use price dispersion only as directional motivation, not as causal evidence of marginal cost. The core physical question is instead: under fixed quality and service targets, when does the binding constraint move from theoretical peak compute toward delivered power, cooling, and operational efficiency? Under this framing, system optimizations -- latent KV-cache compression, sparse or heavily compressed attention, quantization, routing, and difficulty-adaptive reasoning -- are not merely local engineering tricks. They are energy-to-token levers because they reduce FLOPs/token, joules/token, memory traffic, or utilization losses under fixed $(q^{*},s^{*})$. We therefore call for inference papers and benchmarks to report Joules/token, active binding constraint, PUE-adjusted delivered power, and utilization-adjusted token output alongside accuracy and latency.

Problem

Research questions and friction points this paper is trying to address.

LLM inference

energy-to-token

token production

physical constraints

data-center efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

energy-to-token

token production function

LLM inference