Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Microservice systems are prone to failures due to their dynamic interactions and evolving environments, yet existing root cause localization methods suffer from poor interpretability, limited transferability, context explosion, and low inference efficiency. This work proposes RCLAgent, a novel framework that introduces a multi-agent, Recursion-of-Thought–based paradigm for parallel root cause localization. By decomposing the diagnostic task along the call-chain graph and assigning dedicated agents to individual spans, RCLAgent performs recursive, topology-aware parallel reasoning, effectively integrating root-level diagnostic reports with a global evidence graph to achieve precise localization. Experimental results demonstrate that RCLAgent significantly outperforms state-of-the-art methods across multiple benchmarks, achieving notable improvements in both accuracy and inference efficiency.

📝 Abstract

As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

root cause localization

microservices

large language models

context explosion

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent

recursion-of-thought

root cause localization