🤖 AI Summary
This work addresses the challenge of root cause localization in IoT applications under edge computing environments, where microservice architectures are prone to performance anomalies propagating across services. To enable efficient and scalable diagnosis, the authors propose a cascaded graph neural network (GNN) framework. It first partitions large-scale service dependency graphs into highly cohesive subgraphs via communication-driven graph clustering, then employs a two-level subnetwork architecture to hierarchically perform root cause localization and fault type identification. By innovatively integrating communication-aware clustering with a cascaded GNN structure, the method achieves diagnostic accuracy comparable to centralized GNNs while significantly reducing inference latency and offering near-constant scalability. Experiments on the MicroCERCL and iAnomaly datasets demonstrate the effectiveness of the proposed approach.
📝 Abstract
Edge computing environments host increasingly complex microservice-based IoT applications that are prone to performance anomalies propagating across dependent services. Identifying the faulty component (root cause localization) and the underlying fault type (root cause analysis) is essential for timely mitigation. Supervised graph neural networks (GNNs) currently represent the state of the art for joint root cause localization and analysis. However, existing approaches rely on centralized processing over full-system graphs, leading to high inference latency and limited scalability in large, distributed edge environments. In this paper, we propose a cascaded GNN framework for joint RCL and fault type identification that explicitly addresses these scalability challenges. Our approach employs communication-driven clustering to partition large service graphs into highly interacting communities and a cascaded network with two subnetworks that perform hierarchical RCL/RCA. By restricting message passing to reduced and structured subgraphs, the proposed framework significantly lowers computational complexity while preserving critical dependency information. We evaluate the proposed method on the MicroCERCL benchmark and large-scale datasets generated using the iAnomaly simulation framework. Experimental results show that the cascaded architecture achieves diagnostic accuracy comparable to centralized GNN baselines while maintaining near-constant inference latency as graph size increases, enabling scalable and actionable AIOps in edge computing environments.