PowerTrip: Exploiting Federated Heterogeneous Datacenter Power for Distributed ML Training

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale AI model training is constrained by regional grid capacity, necessitating cross-regional datacenter collaboration. However, a fundamental trade-off exists between heterogeneous power supply and inter-site communication latency—yet existing approaches typically assume homogeneous electricity availability, overlooking dynamic variations in both power capacity and network performance across sites. This work proposes a power-cost–inspired dynamic greedy algorithm that jointly models heterogeneous power availability (using real-world Google power traces) and network latency, enabling runtime adaptive site selection guided by marginal gain. The method optimizes the “accuracy–time” efficiency while guaranteeing target accuracy. Experiments on real power data demonstrate that our approach reduces time-to-target-accuracy by up to 50% compared to baseline strategies.

Technology Category

Application Category

📝 Abstract
The exponential growth of large-scale AI models has led to computational and power demands that can exceed the capacity of a single data center. This is due to the limited power supplied by regional grids that leads to limited regional computational power. Consequently, distributing training workloads across geographically distributed sites has become essential. However, this approach introduces a significant challenge in the form of communication overhead, creating a fundamental trade-off between the performance gains from accessing greater aggregate power and the performance losses from increased network latency. Although prior work has focused on reducing communication volume or using heuristics for distribution, these methods assume constant homogeneous power supplies and ignore the challenge of heterogeneous power availability between sites. To address the challenge of training large models in power-constrained, geo-distributed environments, we introduce PowerTrip, a system that dynamically selects a subset of sites during runtime to optimize the power-communication trade-off. Specifically, PowerTrip selects sites based on a power-to-cost heuristic, prioritizing those with high power availability and low network latency. PowerTrip employs a dynamic greedy approach and uses the marginal gain in training efficiency, i.e., accuracy improvement per unit of time, to optimize for the number of sites where the performance penalty from network overhead negates the benefit of adding more computational power. Our evaluation, which uses real-world Google power traces to model realistic power capacity constraints, demonstrates that PowerTrip can reduce time-to-accuracy by up to 50% compared to existing baseline policies.
Problem

Research questions and friction points this paper is trying to address.

Distributing ML training across geo-distributed data centers with power constraints
Balancing power availability and network latency for optimal training efficiency
Overcoming heterogeneous power supply challenges in federated data centers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic site selection optimizes power-communication trade-off
Power-to-cost heuristic prioritizes high power, low latency sites
Marginal gain in training efficiency guides site optimization
🔎 Similar Papers
No similar papers found.