🤖 AI Summary
Traditional data center networks struggle to balance cost and fault tolerance, while random graph topologies have long remained impractical due to the lack of scalable routing and cabling solutions. This work presents RNG, the first successful deployment of a random graph–based architecture in large-scale production environments. RNG introduces a novel distributed routing protocol that supports a large number of edge-disjoint paths and leverages passive optical components to implement endpoint shuffling, significantly simplifying cabling complexity. Under diverse traffic patterns, RNG matches or exceeds the performance of fat-tree topologies while reducing deployment costs by up to 45%. It has since become the default data center network architecture for the majority of Amazon’s workloads.
📝 Abstract
We design and deploy at Amazon the first production datacenter fabrics based on random graphs. While the cost and fault-tolerance benefits of such topologies have been long known, their practical realization has been hampered by a lack of scalable routing and cabling approaches. Our design, called RNG, has a new distributed routing protocol that exploits the properties of random graphs to find a large number of edge disjoint paths between endpoint pairs. A novel passive optical device that internally shuffles cable endpoints makes Amazon's cabling complexity similar to that of fat trees. We show that RNG fabrics match or exceed the performance of fat trees for a range of traffic patterns, despite being up to 45% cheaper. At Amazon, we made RNG the default datacenter fabric for most workloads.