🤖 AI Summary
To address the challenge of root-cause localization in cloud-native applications caused by tight coupling between code and configuration, this paper proposes an LLM-driven dual-graph collaborative root-cause analysis (RCA) framework. It integrates a Service Dependency Graph (SDG) with a fine-grained Hammock-block Program Dependence Graph (PDG) to establish a structured, cross-service-and-code-level graph traversal paradigm. We introduce a novel ReAct-style agent workflow augmented with graph-topology-aware reinforcement learning to jointly optimize diagnostic paths and underlying graph structure. Evaluated on 30 real-world cloud failure cases, our method achieves a 3.1× improvement in root-cause localization accuracy over the state-of-the-art ReAct baseline, reduces token consumption by 3.8×, and enables the construction of the first open-source RCA benchmark.
📝 Abstract
Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.