Prediction-Guided Control in Data Center Networks

πŸ“… 2026-01-07
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the degradation of tail latency in multi-tenant data center networks caused by minute-scale dynamic workloads. To this end, the authors propose Polyphony, the first system that integrates network-wide workload aggregation with approximate counterfactual prediction into a closed-loop control mechanism. Polyphony continuously monitors and predicts service-level quality metrics, dynamically adjusting network configurations to precisely meet operator-defined SLOsβ€”without requiring modifications to existing congestion control or traffic engineering mechanisms. Experimental evaluation in CloudLab demonstrates that Polyphony converges to strict SLOs within 10 minutes and stabilizes within 15 minutes following large workload shifts, significantly outperforming state-of-the-art approaches.

Technology Category

Application Category

πŸ“ Abstract
In this paper, we design, implement, and evaluate Polyphony, a system to give network operators a new way to control and reduce the frequency of poor tail latency events in multi-class data center networks, on the time scale of minutes. Polyphony is designed to be complementary to other adaptive mechanisms like congestion control and traffic engineering, but targets different aspects of network operation that have previously been considered static. By contrast to Polyphony, prior model-free optimization methods work best when there are only a few relevant degrees of freedom and where workloads and measurements are stable, assumptions not present in modern data center networks. Polyphony develops novel methods for measuring, predicting, and controlling network quality of service metrics for a dynamically changing workload. First, we monitor and aggregate workloads on a network-wide basis; we use the result as input to an approximate counterfactual prediction engine that estimates the effect of potential network configuration changes on network quality of service; we apply the best candidate and repeat in a closed-loop manner aimed at rapidly and stably converging to a configuration that meets operator goals. Using CloudLab on a simple topology, we observe that Polyphony converges to tight SLOs within ten minutes, and re-stabilizes after large workload shifts within fifteen minutes, while the prior state of the art fails to adapt.
Problem

Research questions and friction points this paper is trying to address.

tail latency
data center networks
quality of service
network control
dynamic workloads
Innovation

Methods, ideas, or system contributions that make the work stand out.

prediction-guided control
counterfactual prediction
closed-loop network optimization
tail latency reduction
multi-class data center networks
πŸ”Ž Similar Papers
No similar papers found.
K
Kevin Zhao
University of Washington
Chenning Li
Chenning Li
PhD student at MIT CSAIL
Network SimulationsML Systems
A
Anton A. Zabreyko
MIT CSAIL
Arash Nasr-Esfahany
Arash Nasr-Esfahany
PhD Student at MIT
Computer NetworksComputer SystemsMachine LearningCausal Inference
A
Anna Goncharenko
University of Washington
D
David Dai
University of Washington
S
Sidharth Lakshmanan
University of Washington
C
Claire Li
University of Washington
Mohammad Alizadeh
Mohammad Alizadeh
Professor of Computer Science, MIT
Computer networksSystemsMachine learning
T
Thomas E. Anderson
University of Washington