Metric Criticality Identification for Cloud Microservices

📅 2025-01-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Identifying critical health indicators and configuring effective alerts in microservice systems remain challenging due to heavy reliance on manual expertise and the absence of scalable, automated approaches. Method: This paper proposes a lightweight, fully automated method for critical indicator identification that leverages only historical monitoring metrics and minimal trace data. We introduce an indicator–trace coupling modeling framework that jointly integrates information-theoretic measures (mutual information, conditional entropy) with a topology-aware graph neural network—enabling reliability-sensitive indicator ranking without logs or full multimodal telemetry. Results: Evaluated on mainstream benchmarks including DeathStarBench and OnlineBoutique, our approach significantly improves alert coverage and fault detection rates while reducing SRE effort for manual indicator filtering by over 70%. It establishes a scalable, low-overhead paradigm for intelligent alert definition in cloud-native systems.

Technology Category

Application Category

📝 Abstract
For Site Reliability Engineers, alerts are typically the first and often the primary indications that a system may not be performing as expected. Once alerts are triggered, Site Reliability Engineers delve into detailed data across various modalities such as metrics, logs, and traces - to diagnose system issues. However, defining an optimal set of alerts is increasingly challenging due to the sheer volume of multi-modal observability data points in large cloud-native systems. Typically, alerts are manually curated, primarily defined on the metrics modality, and heavily reliant on subject matter experts manually navigating through the large state-space of intricate relationships in multi-modal observability data. Such a process renders defining alerts prone to insufficient coverage, potentially missing critical events. Defining alerts is even more challenging with the shift from traditional monolithic architectures to microservice based architectures due to the intricate interplay between microservices governed by the application topology in an ever stochastic environment. To tackle this issue, we take a data driven approach wherein we propose KIMetrix, a system that relies only on historical metric data and lightweight microservice traces to identify microservice metric criticality. KIMetrix significantly aids Subject Matter Experts by identifying a critical set of metrics to define alerts, averting the necessity of weaving through the vast multi-modal observability sphere. KIMetrix delves deep into the metric-trace coupling and leverages information theoretic measures to recommend microservice-metric mappings in a microservice topology-aware manner. Experimental evaluation on state-of-the-art microservice based applications demonstrates the effectiveness of our approach.
Problem

Research questions and friction points this paper is trying to address.

Microservices Architecture
Health Monitoring
Data Filtering
Innovation

Methods, ideas, or system contributions that make the work stand out.

KIMetrix
Microservices Monitoring
Automated Alert Optimization
🔎 Similar Papers
No similar papers found.
Akanksha Singal
Akanksha Singal
IIITD
Divya Pathak
Divya Pathak
Indian Institute of Technology Hyderabad
Software Defined NetworkingIn-Network Systems and Security
K
Kaustabha Ray
IBM Research - India, India
F
Felix George
IBM Research - India, India
M
Mudit Verma
IBM Research - India, India
P
Pratibha Moogi
IBM Research - India, India