Mitigating the Performance Impact of Network Failures in Public Clouds

📅 2023-05-23

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing network mitigation systems for public clouds—where network failures often persist for hours to days—rely on localized heuristics or coarse-grained proxy metrics, failing to optimize end-to-end flow-level performance. This paper introduces the first real-time Connection-Level Performance (CLP) optimization framework for network failure mitigation. Our approach comprises three core components: (1) a lightweight, high-fidelity, and scalable online CLP estimation model; (2) joint failure-aware path ranking and dynamic calibration of proxy metrics; and (3) online decision optimization driven by CLP feedback. Evaluated on real-world failure traces from a major cloud provider, our framework improves mitigation efficacy by over 700× compared to state-of-the-art baselines. It significantly reduces latency jitter and packet loss rate, while substantially enhancing connection stability for end users.

📝 Abstract

Some faults in data center networks require hours to days to repair because they may need reboots, re-imaging, or manual work by technicians. To reduce traffic impact, cloud providers extit{mitigate} the effect of faults, for example, by steering traffic to alternate paths. The state-of-art in automatic network mitigations uses simple safety checks and proxy metrics to determine mitigations. SWARM, the approach described in this paper, can pick orders of magnitude better mitigations by estimating end-to-end connection-level performance (CLP) metrics. At its core is a scalable CLP estimator that quickly ranks mitigations with high fidelity and, on failures observed at a large cloud provider, outperforms the state-of-the-art by over 700$ imes$ in some cases.

Problem

Research questions and friction points this paper is trying to address.

Optimizing end-to-end flow metrics for network failure mitigation

Ranking mitigation actions holistically for higher effectiveness

Scaling mitigation techniques to large cloud datacenters

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimize end-to-end flow-level metrics directly

Estimate mitigation impacts quickly and accurately

Scale effectively in large datacenter environments

🔎 Similar Papers

No similar papers found.