🤖 AI Summary
This work addresses the growing tension between surging AI compute demand and the limited capacity of power grids by proposing AI Greenferencing—an architecture that deploys modular AI inference services directly at remote wind farms to harness their surplus renewable energy. To mitigate the instability of compute resources caused by wind intermittency, the authors design XWind, a lightweight, task-agnostic cross-site inference routing system that dynamically schedules requests based on real-time metrics including latency, KV cache utilization, and queue depth, achieving efficient load balancing without requiring future predictions. XWind integrates round-trip network latency analysis, site capacity adaptation, and multi-site coordination, supporting heterogeneous GPUs and diverse workloads. Experiments on a 64-GPU A100 testbed demonstrate that, compared to the strongest baseline, XWind reduces P99 end-to-end latency by 52% and cuts power consumption by up to 98%, while maintaining robust performance across varying loads and hardware configurations.
📝 Abstract
AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand.
This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments.
To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.