🤖 AI Summary
In cloud-native microservices, dynamic dependency evolution and cascading failure propagation severely degrade root cause analysis (RCA) accuracy and robustness; existing methods struggle with concept drift, observational noise, and service-level biases that obscure true root causes. To address these challenges, we propose a Dynamic Causal-Aware RCA framework: (1) modeling time-varying spatiotemporal dependencies via interaction-aware representation learning and multimodal dynamic signal fusion; (2) employing a dynamic contrastive mechanism to disentangle failure signals from contextual noise; and (3) introducing a causal-prioritized pairwise ranking objective to enhance interpretability of root cause identification. Evaluated on public benchmarks, our method achieves an Accuracy@1 of 0.63—outperforming state-of-the-art approaches by an absolute margin of 0.25–0.46—demonstrating substantial improvements in both fault localization precision and robustness against evolving system dynamics.
📝 Abstract
Cloud-native microservices enable rapid iteration and scalable deployment but also create complex, fast-evolving dependencies that challenge reliable diagnosis. Existing root cause analysis (RCA) approaches, even with multi-modal fusion of logs, traces, and metrics, remain limited in capturing dynamic behaviors and shifting service relationships. Three critical challenges persist: (i) inadequate modeling of cascading fault propagation, (ii) vulnerability to noise interference and concept drift in normal service behavior, and (iii) over-reliance on service deviation intensity that obscures true root causes. To address these challenges, we propose DynaCausal, a dynamic causality-aware framework for RCA in distributed microservice systems. DynaCausal unifies multi-modal dynamic signals to capture time-varying spatio-temporal dependencies through interaction-aware representation learning. It further introduces a dynamic contrastive mechanism to disentangle true fault indicators from contextual noise and adopts a causal-prioritized pairwise ranking objective to explicitly optimize causal attribution. Comprehensive evaluations on public benchmarks demonstrate that DynaCausal consistently surpasses state-of-the-art methods, attaining an average AC@1 of 0.63 with absolute gains from 0.25 to 0.46, and delivering both accurate and interpretable diagnoses in highly dynamic microservice environments.