Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

📅 2025-12-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of root-cause localization in cloud-native applications caused by tight coupling between code and configuration, this paper proposes an LLM-driven dual-graph collaborative root-cause analysis (RCA) framework. It integrates a Service Dependency Graph (SDG) with a fine-grained Hammock-block Program Dependence Graph (PDG) to establish a structured, cross-service-and-code-level graph traversal paradigm. We introduce a novel ReAct-style agent workflow augmented with graph-topology-aware reinforcement learning to jointly optimize diagnostic paths and underlying graph structure. Evaluated on 30 real-world cloud failure cases, our method achieves a 3.1× improvement in root-cause localization accuracy over the state-of-the-art ReAct baseline, reduces token consumption by 3.8×, and enables the construction of the first open-source RCA benchmark.

Technology Category

Application Category

📝 Abstract
Cloud incidents pose major operational challenges in production, with unresolved production cloud incidents cost on average over $2M per hour. Prior research identifies code- and configuration-related issues as the predominant category of root causes in cloud incidents. This paper introduces PRAXIS, an orchestrator that manages and deploys an agentic workflow for diagnosing code- and configuration-caused cloud incidents. PRAXIS employs an LLM-driven structured traversal over two types of graph: (1) a service dependency graph (SDG) that captures microservice-level dependencies; and (2) a hammock-block program dependence graph (PDG) that captures code-level dependencies for each microservice. Together, these graphs encode microservice- and code-level dependencies and the LLM acts as a traversal policy over these graphs, moving between services and code dependencies to localize and explain failures. Compared to state-of-the-art ReAct baselines, PRAXIS improves RCA accuracy by up to 3.1x while reducing token consumption by 3.8x. PRAXIS is demonstrated on a set of 30 comprehensive real-world incidents that is being compiled into an RCA benchmark.
Problem

Research questions and friction points this paper is trying to address.

Diagnoses code- and configuration-caused cloud incidents using agentic workflow
Traverses service and code dependency graphs to localize and explain failures
Improves root cause analysis accuracy while reducing computational token usage
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven structured traversal over service and code dependency graphs
Orchestrates agentic workflow for diagnosing code-related cloud incidents
Improves root cause analysis accuracy while reducing token consumption
🔎 Similar Papers
No similar papers found.
S
Shengkun Cui
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
R
Rahul Krishna
IBM Research, Yorktown Heights, NY 10598, USA
Saurabh Jha
Saurabh Jha
Sr. Research Scientist, IBM
ML for SystemsSystems for MLReliability
R
Ravishankar K. Iyer
University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA