🤖 AI Summary
This work addresses the challenges of federated learning across heterogeneous high-performance computing (HPC) facilities, where batch schedulers induce stochastic queuing delays that impede synchronous methods and cause stale updates in asynchronous approaches. To mitigate these issues, the authors propose FedQueue, the first protocol to explicitly model HPC queuing delays within the federated learning pipeline. FedQueue dynamically allocates local computation based on online predictions of per-node queuing times, employs deadline-aware admission control to buffer late-arriving updates, and incorporates a staleness-aware aggregation mechanism. The paper establishes convergence guarantees for non-convex objectives and ensures bounded staleness with high probability. Empirical evaluations across real multi-facility HPC environments demonstrate that FedQueue achieves a 20.5% speedup over baseline methods and reduces time-to-target accuracy by approximately 34% under high queuing variance and non-IID data distributions.
📝 Abstract
Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate $\mathcal{O}(1/\sqrt{R})$ under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.