🤖 AI Summary
To address performance interference and low resource utilization caused by resource sharing in large-scale co-located microservice clusters, this paper proposes the first production-ready CPI (Cycles Per Instruction)-driven interference-aware management framework. Methodologically, it integrates node heterogeneity and application diversity to build a multi-dimensional metric-based machine learning prediction model, and designs C-Koordinator—an open-source collaborative scheduling platform—incorporating dynamic scheduling, fine-grained resource isolation, and real-time interference mitigation. Its key contribution is the first systematic validation, at scale in production, of CPI as an effective interference metric. Experimental results demonstrate >90.3% interference prediction accuracy and 16.7%–36.1% reductions in P50–P99 response times, significantly enhancing service performance stability and resource utilization efficiency in co-location environments.
📝 Abstract
Microservices transform traditional monolithic applications into lightweight, loosely coupled application components and have been widely adopted in many enterprises. Cloud platform infrastructure providers enhance the resource utilization efficiency of microservices systems by co-locating different microservices. However, this approach also introduces resource competition and interference among microservices. Designing interference-aware strategies for large-scale, co-located microservice clusters is crucial for enhancing resource utilization and mitigating competition-induced interference. These challenges are further exacerbated by unreliable metrics, application diversity, and node heterogeneity.
In this paper, we first analyze the characteristics of large-scale and co-located microservices clusters at Alibaba and further discuss why cycle per instruction (CPI) is adopted as a metric for interference measurement in large-scale production clusters, as well as how to achieve accurate prediction of CPI through multi-dimensional metrics. Based on CPI interference prediction and analysis, we also present the design of the C-Koordinator platform, an open-source solution utilized in Alibaba cluster, which incorporates co-location and interference mitigation strategies. The interference prediction models consistently achieve over 90.3% accuracy, enabling precise prediction and rapid mitigation of interference in operational environments. As a result, application latency is reduced and stabilized across all percentiles (P50, P90, P99) response time (RT), achieving improvements ranging from 16.7% to 36.1% under various system loads compared with state-of-the-art system. These results demonstrate the system's ability to maintain smooth application performance in co-located environments.