🤖 AI Summary
Existing network mitigation systems for public clouds—where network failures often persist for hours to days—rely on localized heuristics or coarse-grained proxy metrics, failing to optimize end-to-end flow-level performance. This paper introduces the first real-time Connection-Level Performance (CLP) optimization framework for network failure mitigation. Our approach comprises three core components: (1) a lightweight, high-fidelity, and scalable online CLP estimation model; (2) joint failure-aware path ranking and dynamic calibration of proxy metrics; and (3) online decision optimization driven by CLP feedback. Evaluated on real-world failure traces from a major cloud provider, our framework improves mitigation efficacy by over 700× compared to state-of-the-art baselines. It significantly reduces latency jitter and packet loss rate, while substantially enhancing connection stability for end users.
📝 Abstract
Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers extit{mitigate} the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700$ imes$ in some cases.