The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the significant impact of context length on energy efficiency in large language model inference, revealing up to a 40× difference under identical hardware. We establish, for the first time, a quantitative inverse relationship between context window length $W$ and energy efficiency, scaling as $1/W$. To exploit this insight, we propose FleetOpt, a two-pool context routing strategy. Using the inference-fleet-sim framework augmented with calibrated power measurements and a Roofline model, our purely analytical evaluation demonstrates that FleetOpt improves energy efficiency by 2.5×—outperforming the 1.7× gain from upgrading to B200 hardware—and achieves a combined improvement of 4.25× when both are applied. At 8K context length, Qwen3-235B-A22B attains 37.8 tokens per watt, representing a 5.1× advantage over Llama-3.1-70B.

Technology Category

Application Category

📝 Abstract
How many tokens can a GPU inference cluster deliver per watt? Across deployments of identical hardware, the answer varies by 40x -- not because of software inefficiency, but because of the serving context window. We derive the 1/W law: tokens per watt halves every time the context window doubles. A larger context window shrinks the KV-cache concurrency limit while leaving GPU power draw roughly unchanged. At 64K context, an H100 holds 16 sequences in flight (tok/W = 1.5); at 4K context, the same H100 holds 256 sequences (tok/W = 17.6). Routing topology -- which determines the effective context window each GPU services -- is a more powerful energy lever than buying newer hardware. Working from published H100 power measurements, a calibrated logistic power model, and a roofline throughput model, we derive these results analytically using the inference-fleet-sim framework; no new hardware experiments were conducted. Two-pool context-length routing (FleetOpt) delivers roughly 2.5x better tok/W over a homogeneous fleet, while upgrading from H100 to B200 delivers roughly 1.7x. The gains are independent: combining FleetOpt with B200 yields 4.25x over the H100 homogeneous baseline. B200/H200 numbers are analytical projections (+-20% uncertainty); H100 results are calibrated to published measurements. For MoE models, active-parameter weight streaming adds a third lever. Qwen3-235B-A22B (22B active) reaches roughly 37.8 tok/W at 8K context on H100 -- 5.1x better than Llama-3.1-70B -- because decode time scales with activated weights, not total parameters. MoE dispatch overhead is excluded, so this is an upper bound.
Problem

Research questions and friction points this paper is trying to address.

context-length
energy efficiency
LLM inference
routing topology
KV-cache concurrency
Innovation

Methods, ideas, or system contributions that make the work stand out.

1/W Law
context-length routing
energy efficiency
LLM inference
MoE models