Making congestion control robust to per-packet load balancing in datacenters

📅 2025-09-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Per-packet load balancing in data centers distorts multipath feedback, causing existing congestion control algorithms (CCAs) to misinterpret duplicate ACKs and suffer severe performance degradation. We observe that merely suppressing duplicate ACKs fails to address the fundamental challenge of signal heterogeneity across multiple paths. To this end, we propose Median Feedback—a robust feedback mechanism that replaces per-path RTT estimation with the median RTT across all active paths. Integrated into the Swift framework as MSwift, this design ensures resilience to both path heterogeneity and dynamic congestion. MSwift preserves Swift’s advantages in single-path and incast scenarios while significantly improving multipath adaptability. Experimental evaluations under realistic conditions—including random spraying and adaptive routing—demonstrate that MSwift reduces the 99th-percentile flow completion time (FCT) by up to 25% compared to baseline approaches.

Technology Category

Application Category

📝 Abstract
Per-packet load-balancing approaches are increasingly deployed in datacenter networks. However, their combination with existing congestion control algorithms (CCAs) may lead to poor performance, and even state-of-the-art CCAs can collapse due to duplicate ACKs. A typical approach to handle this collapse is to make CCAs resilient to duplicate ACKs. In this paper, we first model the throughput collapse of a wide array of CCAs when some of the paths are congested. We show that addressing duplicate ACKs is insufficient. Instead, we explain that since CCAs are typically designed for single-path routing, their estimation function focuses on the latest feedback and mishandles feedback that reflects multiple paths. We propose to use a median feedback that is more robust to the varying signals that come with multiple paths. We introduce MSwift, which applies this principle to make Google's Swift robust to multi-path routing while keeping its incast tolerance and single-path performance. Finally, we demonstrate that MSwift improves the 99th-percentile FCT by up to 25%, both with random packet spraying and adaptive routing.
Problem

Research questions and friction points this paper is trying to address.

Addressing throughput collapse in congestion control under per-packet load balancing
Handling duplicate ACKs and multi-path feedback mismanagement in datacenters
Improving congestion control robustness to varying multi-path network signals
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses median feedback for robust congestion control
Applies principle to Google's Swift algorithm
Improves performance with multi-path routing