🤖 AI Summary
To address high tail latency caused by resource contention in edge-cloud collaborative environments, this paper proposes a performance-aware load balancing method based on round-trip time (RTT) prediction. Unlike conventional reactive strategies, our approach employs a lightweight time-series forecasting model to enable proactive request scheduling. We formally quantify, for the first time, the minimum required prediction accuracy threshold and identify key system-level factors influencing scheduling efficacy. Leveraging monitoring data from a Kubernetes-managed GPU cluster, we construct a compact feature set exploiting strong metric correlations, achieving 95% prediction accuracy while incurring less than 10% overhead relative to application RTT. Experimental results demonstrate significant reductions in end-to-end and tail latency, improved resource utilization, and robust compatibility with multi-tenant co-location scenarios and heterogeneous hardware infrastructures.
📝 Abstract
Distributed applications increasingly demand low end-to-end latency, especially in edge and cloud environments where co-located workloads contend for limited resources. Traditional load-balancing strategies are typically reactive and rely on outdated or coarse-grained metrics, often leading to suboptimal routing decisions and increased tail latencies. This paper investigates the use of round-trip time (RTT) predictors to enhance request routing by anticipating application latency. We develop lightweight and accurate RTT predictors that are trained on time-series monitoring data collected from a Kubernetes-managed GPU cluster. By leveraging a reduced set of highly correlated monitoring metrics, our approach maintains low overhead while remaining adaptable to diverse co-location scenarios and heterogeneous hardware. The predictors achieve up to 95% accuracy while keeping the prediction delay within 10% of the application RTT. In addition, we identify the minimum prediction accuracy threshold and key system-level factors required to ensure effective predictor deployment in resource-constrained clusters. Simulation-based evaluation demonstrates that performance-aware load balancing can significantly reduce application RTT and minimize resource waste. These results highlight the feasibility of integrating predictive load balancing into future production systems.