GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Root cause analysis (RCA) for cross-modal telemetry (metrics, logs, traces) in microservice systems suffers from slow diagnosis, poor interpretability, and lack of actionable remediation guidance. To address these challenges, we propose the first multimodal RCA framework integrating statistical causal discovery with large language model (LLM) agents. Our method constructs a causal-graph-guided iterative reasoning workflow that bridges suspicious service ranking to causally interpretable and operationally actionable repair recommendations; it further introduces graph-augmented LLM inference and joint modeling of heterogeneous telemetry sources. Evaluated on an open-source benchmark, our approach improves root cause localization accuracy by 42.22%. Human evaluation confirms that its diagnostic outputs exhibit significantly higher causal plausibility and operational feasibility, effectively narrowing the semantic gap between automated diagnosis and frontline incident response.

Technology Category

Application Category

📝 Abstract
Root cause analysis (RCA) in microservice systems is challenging, requiring on-call engineers to rapidly diagnose failures across heterogeneous telemetry such as metrics, logs, and traces. Traditional RCA methods often focus on single modalities or merely rank suspect services, falling short of providing actionable diagnostic insights with remediation guidance. This paper introduces GALA, a novel multi-modal framework that combines statistical causal inference with LLM-driven iterative reasoning for enhanced RCA. Evaluated on an open-source benchmark, GALA achieves substantial improvements over state-of-the-art methods of up to 42.22% accuracy. Our novel human-guided LLM evaluation score shows GALA generates significantly more causally sound and actionable diagnostic outputs than existing methods. Through comprehensive experiments and a case study, we show that GALA bridges the gap between automated failure diagnosis and practical incident resolution by providing both accurate root cause identification and human-interpretable remediation guidance.
Problem

Research questions and friction points this paper is trying to address.

Enhancing root cause analysis in microservice systems
Combining causal inference with LLM reasoning for RCA
Providing actionable diagnostic insights and remediation guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Graph-augmented LLM for multi-modal RCA
Combines causal inference with iterative reasoning
Provides actionable diagnostics and remediation guidance
🔎 Similar Papers
No similar papers found.