π€ AI Summary
Root cause localization in microservice systems within cloud-native environments is challenged by multi-source, heterogeneous observability data and multi-layered system entities, which existing methods struggle to model comprehensively due to their complex dependencies. This work reveals, for the first time, the asymmetric cross-layer fault propagation patterns induced by hierarchical differences among system entities and proposes a semi-supervised root cause localization framework that integrates heterogeneous graph neural networks, event abstraction, and active learning. By modeling services and hosts as heterogeneous nodes, the framework inherently captures these propagation dynamics. Evaluated on two industrial benchmark datasets, the approach achieves up to a 49.85% improvement in Top-1 accuracy and a 32.70% gain in average Top-5 accuracy, significantly outperforming state-of-the-art methods.
π Abstract
Microservice root cause localization is fundamentally challenged by the inherent heterogeneity of cloud-native systems, which encompasses diverse observability data and multiple system entities. Existing approaches typically focus on only one aspect of heterogeneity and thus fail to capture its full diagnostic value. In this work, we systematically examine the multifaceted role of heterogeneity within both microservice systems and the RCL process. This analysis motivates a deeper investigation into how entity-level distinctions and their asymmetric dependencies influence fault behavior. Our empirical analysis of two microservice benchmarks reveals that entity-level heterogeneity naturally gives rise to heterogeneous fault propagation, which is highly asymmetric and dominated by cross-layer interactions between services and hosts. In light of this, we propose NexusRCL, a semi-supervised framework that internalizes these propagation patterns by formalizing services and hosts as distinct node types within a heterogeneous graph. This design, coupled with an event-based abstraction mechanism, allows NexusRCL to effectively capture both data level and entity-level heterogeneity while minimizing labeling costs through active learning. Comprehensive evaluations on two industrial benchmark datasets demonstrate NexusRCL's superior performance, achieving improvements of up to 49.85\% in Top-1 accuracy (A@1) and 32.70\% in Average Top-5 accuracy (A@5) compared to state-of-the-art methods.