🤖 AI Summary
To address rapid anomaly propagation in microservice-based IoT applications at the edge and the high latency and communication overhead introduced by existing centralized root cause localization (RCL) methods, this paper proposes a decentralized RCL framework. Methodologically, it integrates: (1) a communication- and co-location-aware microservice clustering mechanism; (2) a lightweight cross-cluster peer-to-peer approximate coordination strategy; and (3) a multi-source fusion scoring mechanism tailored to multi-layer heterogeneous anomaly triggers. Leveraging personalized PageRank, the framework performs local graph analysis and collaborative inference directly on edge nodes, enabling fully distributed root cause localization. Evaluations on the MicroCERCL dataset demonstrate that the approach achieves accuracy comparable to or exceeding centralized baselines, while reducing average localization time by up to 34%. This significantly enhances diagnostic efficiency and scalability in resource-constrained edge environments.
📝 Abstract
Edge computing environments host increasingly complex microservice-based IoT applications, which are prone to performance anomalies that can propagate across dependent services. Identifying the true source of such anomalies, known as Root Cause Localization (RCL), is essential for timely mitigation. However, existing RCL approaches are designed for cloud environments and rely on centralized analysis, which increases latency and communication overhead when applied at the edge. This paper proposes a decentralized RCL approach that executes localization directly at the edge device level using the Personalized PageRank (PPR) algorithm. The proposed method first groups microservices into communication- and colocation-aware clusters, thereby confining most anomaly propagation within cluster boundaries. Within each cluster, PPR is executed locally to identify the root cause, significantly reducing localization time. For the rare cases where anomalies propagate across clusters, we introduce an inter-cluster peer-to-peer approximation process, enabling lightweight coordination among clusters with minimal communication overhead. To enhance the accuracy of localization in heterogeneous edge environments, we also propose a novel anomaly scoring mechanism tailored to the diverse anomaly triggers that arise across microservice, device, and network layers. Evaluation results on the publicly available edge dataset, MicroCERCL, demonstrate that the proposed decentralized approach achieves comparable or higher localization accuracy than its centralized counterpart while reducing localization time by up to 34%. These findings highlight that decentralized graph-based RCL can provide a practical and efficient solution for anomaly diagnosis in resource-constrained edge environments.