REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation

📅 2024-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale AI-training datacenter networks suffer from load imbalance under high traffic and slow failure recovery; existing solutions such as ECMP and OPS struggle to simultaneously achieve scalability and real-time adaptability. This paper proposes a lightweight, decentralized per-packet adaptive load-balancing algorithm: it dynamically selects paths based on hash entropy, maintains compact local path-state caches (<25 bytes per flow), and introduces a novel entropy-driven path recycling mechanism. The design natively supports unordered transport protocols and requires neither global topology awareness nor centralized control. Implemented via FPGA-accelerated NICs, it achieves sub-100 μs failure detection and traffic rerouting. Evaluation—via both simulation and physical testbed—demonstrates significantly improved network utilization, failure recovery latency <100 μs, and native compatibility with emerging transport standards such as Ultra Ethernet.

Technology Category

Application Category

📝 Abstract
Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. Existing solutions designed for Ethernet, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilizations as datacenter topologies (and network failures as a consequence) continue to grow. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and introduces less than 25 bytes of per-connection state. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.
Problem

Research questions and friction points this paper is trying to address.

Network Balancing
Artificial Intelligence Training
Fault Recovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

REPS
Dynamic Workload Adjustment
Rapid Recovery
🔎 Similar Papers
No similar papers found.