C-Koordinator: Interference-aware Management for Large-scale and Co-located Microservice Clusters

📅 2025-07-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address performance interference and low resource utilization caused by resource sharing in large-scale co-located microservice clusters, this paper proposes the first production-ready CPI (Cycles Per Instruction)-driven interference-aware management framework. Methodologically, it integrates node heterogeneity and application diversity to build a multi-dimensional metric-based machine learning prediction model, and designs C-Koordinator—an open-source collaborative scheduling platform—incorporating dynamic scheduling, fine-grained resource isolation, and real-time interference mitigation. Its key contribution is the first systematic validation, at scale in production, of CPI as an effective interference metric. Experimental results demonstrate >90.3% interference prediction accuracy and 16.7%–36.1% reductions in P50–P99 response times, significantly enhancing service performance stability and resource utilization efficiency in co-location environments.

Technology Category

Application Category

📝 Abstract
Microservices transform traditional monolithic applications into lightweight, loosely coupled application components and have been widely adopted in many enterprises. Cloud platform infrastructure providers enhance the resource utilization efficiency of microservices systems by co-locating different microservices. However, this approach also introduces resource competition and interference among microservices. Designing interference-aware strategies for large-scale, co-located microservice clusters is crucial for enhancing resource utilization and mitigating competition-induced interference. These challenges are further exacerbated by unreliable metrics, application diversity, and node heterogeneity. In this paper, we first analyze the characteristics of large-scale and co-located microservices clusters at Alibaba and further discuss why cycle per instruction (CPI) is adopted as a metric for interference measurement in large-scale production clusters, as well as how to achieve accurate prediction of CPI through multi-dimensional metrics. Based on CPI interference prediction and analysis, we also present the design of the C-Koordinator platform, an open-source solution utilized in Alibaba cluster, which incorporates co-location and interference mitigation strategies. The interference prediction models consistently achieve over 90.3% accuracy, enabling precise prediction and rapid mitigation of interference in operational environments. As a result, application latency is reduced and stabilized across all percentiles (P50, P90, P99) response time (RT), achieving improvements ranging from 16.7% to 36.1% under various system loads compared with state-of-the-art system. These results demonstrate the system's ability to maintain smooth application performance in co-located environments.
Problem

Research questions and friction points this paper is trying to address.

Manage resource competition in co-located microservice clusters
Predict interference using cycle per instruction (CPI) metrics
Reduce application latency through interference-aware strategies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses CPI for interference measurement in clusters
Predicts CPI via multi-dimensional metrics accurately
Implements co-location and interference mitigation strategies
🔎 Similar Papers
No similar papers found.
S
Shengye Song
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Minxian Xu
Minxian Xu
Associate Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingMicroservicesLLM Inference
Zuowei Zhang
Zuowei Zhang
Alibaba Group, Beijing, China
C
Chengxi Gao
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
F
Fansong Zeng
Alibaba Group, Beijing, China
Y
Yu Ding
Alibaba Group, Beijing, China
Kejiang Ye
Kejiang Ye
Professor, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences
Cloud ComputingAI SystemsIndustrial Internet
C
Chengzhong Xu
State Key Lab of IOTSC, University of Macau, Macau, China