SIRD: A Sender-Informed, Receiver-Driven Datacenter Transport Protocol

📅 2023-12-24
🏛️ Symposium on Networked Systems Design and Implementation
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the throughput–latency trade-off in datacenter networks with limited buffering—particularly under multi-receiver contention for shared ToR–spine links, which induces scheduling conflicts and uncontrolled congestion—this paper proposes a receiver-driven, sender-aware hybrid transport protocol. We introduce the novel “link-ownership awareness” control paradigm, the first to model the sender’s uplink as a shared bottleneck and jointly achieve capacity adaptation via receiver-side scheduling and real-time sender feedback. The protocol integrates precise receiver-driven scheduling, feedback-based rate control, and lightweight reactive congestion control. Implemented atop the Caladan protocol stack, it achieves 100 Gbps throughput with queueing latency approaching the theoretical optimum. It outperforms state-of-the-art protocols—including Homa, dcPIM, and Swift—across key metrics: link utilization, queuing delay, and bandwidth efficiency.
📝 Abstract
Datacenter congestion control protocols are challenged to navigate the throughput-buffering trade-off while relative packet buffer capacity is trending lower year-over-year. In this context, receiver-driven protocols -- which schedule packet transmissions instead of reacting to congestion -- excel when the bottleneck lies at the ToR-to-receiver link. However, when multiple receivers must use a shared link (e.g., ToR to Spine), their independent schedules can conflict. We present SIRD, a receiver-driven congestion control protocol designed around the simple insight that single-owner links should be scheduled, while shared links should be managed with reactive control algorithms. The approach allows receivers to both precisely schedule their downlinks and to coordinate over shared bottlenecks. Critically, SIRD also treats sender uplinks as shared links, enabling the flow of congestion feedback from senders to receivers, which then adapt their scheduling to each sender's real-time capacity. This results in tight scheduling, enabling high bandwidth utilization with little contention, and thus minimal latency-inducing buffering in the fabric. We implement SIRD on top of the Caladan stack and show that SIRD's asymmetric design can deliver 100Gbps in software while keeping network queuing minimal. We further compare SIRD to state-of-the-art receiver-driven protocols (Homa, dcPIM, and ExpressPass) and production-grade reactive protocols (Swift and DCTCP) and show that SIRD is uniquely able to simultaneously maximize link utilization, minimize queuing, and obtain near-optimal latency.
Problem

Research questions and friction points this paper is trying to address.

Balancing throughput-buffering trade-off in datacenter congestion control
Resolving scheduling conflicts in receiver-driven protocols for shared links
Optimizing link utilization and minimizing queuing with sender-informed feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Receiver-driven protocol with sender-informed scheduling
Combines scheduled and reactive control for links
Enables high bandwidth with minimal queuing