Multi-Agent Systems for Root Cause Analysis in Microservices

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This work addresses the limitations of traditional root cause analysis (RCA) methods for large language model (LLM)-based microservices, which often rely on a single diagnostic path and struggle with multifactorial, coupled failures. To overcome these challenges, the authors propose LATS-RCA, a novel framework that introduces multi-agent collaboration and Language Agent Tree Search (LATS) into RCA for the first time. The approach employs multiple LLM agents to concurrently analyze logs and performance metrics, dynamically gathering evidence and pruning low-scoring paths through reflection-guided tree search. Experimental results demonstrate that LATS-RCA achieves high accuracy on the Light-OAuth2 dataset and exhibits strong robustness and interpretability in real-world production environments, effectively handling heterogeneous technology stacks and composite root causes.

📝 Abstract

Recent advances in large language models (LLMs) have enabled early attempts to automate root cause analysis (RCA) in microservice-based systems (MSS). Yet, prior works typically rely on a linear reasoning process that proceeds along a single diagnostic path. In this paper, we propose LATS-RCA, an LLM-based multi-agent framework for RCA in MSS. LATS-RCA formulates RCA as a reflection-guided tree-structured search using a Language Agent Tree Search algorithm. In LATS-RCA, multiple LLM-driven agents iteratively perform RCA for each microservice by reasoning over its execution logs and performance metrics to collect operational evidence for root cause exploration. Reflection scores derived from intermediate diagnostic states are used to guide the search toward the most likely root cause based on accumulated evidence. We evaluate LATS-RCA on the open-source industrial MSS, Light-OAuth2 (LO2), using a publicly available dataset and in a production microservice environment (Prod) in a case company with substantially higher operational complexity. LO2 is a small-team Java system with a homogeneous technology stack. The results on LO2 show that LATS-RCA achieves high diagnostic accuracy, and we further benchmark its associated computational costs. Compared to LO2, Prod attains lower diagnostic accuracy and incurs higher computational cost. The Prod deployment demonstrates the practical applicability of LATS-RCA in real-world MSS and reflects the challenges introduced by polyglot tech stack, varied logging practices of source components, and multi-factor root-causes by production-scale MSS.

Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis

Microservices

Multi-Agent Systems

Large Language Models

Operational Complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Systems

Root Cause Analysis

Language Agent Tree Search