REPS: Recycled Entropy Packet Spraying for Adaptive Load Balancing and Failure Mitigation

📅 2024-07-31

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Large-scale AI-training datacenter networks suffer from load imbalance under high traffic and slow failure recovery; existing solutions such as ECMP and OPS struggle to simultaneously achieve scalability and real-time adaptability. This paper proposes a lightweight, decentralized per-packet adaptive load-balancing algorithm: it dynamically selects paths based on hash entropy, maintains compact local path-state caches (<25 bytes per flow), and introduces a novel entropy-driven path recycling mechanism. The design natively supports unordered transport protocols and requires neither global topology awareness nor centralized control. Implemented via FPGA-accelerated NICs, it achieves sub-100 μs failure detection and traffic rerouting. Evaluation—via both simulation and physical testbed—demonstrates significantly improved network utilization, failure recovery latency <100 μs, and native compatibility with emerging transport standards such as Ultra Ethernet.

Technology Category

Application Category

📝 Abstract

Next-generation datacenters require highly efficient network load balancing to manage the growing scale of artificial intelligence (AI) training and general datacenter traffic. Existing solutions designed for Ethernet, such as Equal Cost Multi-Path (ECMP) and oblivious packet spraying (OPS), struggle to maintain high network utilizations as datacenter topologies (and network failures as a consequence) continue to grow. To address these limitations, we propose REPS, a lightweight decentralized per-packet adaptive load balancing algorithm designed to optimize network utilization while ensuring rapid recovery from link failures. REPS adapts to network conditions by caching good-performing paths. In case of a network failure, REPS re-routes traffic away from it in less than 100 microseconds. REPS is designed to be deployed with next-generation out-of-order transports, such as Ultra Ethernet, and introduces less than 25 bytes of per-connection state. We extensively evaluate REPS in large-scale simulations and FPGA-based NICs.

Problem

Research questions and friction points this paper is trying to address.

Network Balancing

Artificial Intelligence Training

Fault Recovery

Innovation

Methods, ideas, or system contributions that make the work stand out.

REPS

Dynamic Workload Adjustment

Rapid Recovery

🔎 Similar Papers

No similar papers found.

AMD

Austin, Texas, United States

Principal AI Network Architect

Microsoft

$139,900 -

San Francisco Bay area / New York City metropolitan area

Authors to Follow