Towards In-Depth Root Cause Localization for Microservices with Multi-Agent Recursion-of-Thought

πŸ“… 2026-05-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

209K/year
πŸ€– AI Summary
Microservice systems are prone to failures due to their dynamic interactions and evolving environments, yet existing root cause localization methods suffer from poor interpretability, limited transferability, context explosion, and low inference efficiency. This work proposes RCLAgent, a novel framework that introduces a multi-agent, Recursion-of-Thought–based paradigm for parallel root cause localization. By decomposing the diagnostic task along the call-chain graph and assigning dedicated agents to individual spans, RCLAgent performs recursive, topology-aware parallel reasoning, effectively integrating root-level diagnostic reports with a global evidence graph to achieve precise localization. Experimental results demonstrate that RCLAgent significantly outperforms state-of-the-art methods across multiple benchmarks, achieving notable improvements in both accuracy and inference efficiency.
πŸ“ Abstract
As modern microservice systems grow increasingly complex due to dynamic interactions and evolving runtime environments, they experience failures with rising frequency. Ensuring system reliability therefore critically depends on accurate root cause localization (RCL). While numerous traditional machine learning and deep learning approaches have been explored for this task, they often suffer from limited interpretability and poor transferability across deployments. More recently, large language model (LLM)-based methods have been proposed to address these issues. However, existing LLM-based approaches still face two fundamental limitations: context explosion, which dilutes critical evidence and degrades localization accuracy, and serial reasoning structures, which hinder deep causal exploration and impair inference efficiency. In this paper, we conduct a comprehensive study of both how human SREs perform root cause localization in practice and why existing LLM-based methods fall short. Motivated by these findings, we introduce RCLAgent, an in-depth root cause localization framework for microservice systems that realizes multi-agent recursion-of-thought with parallel reasoning. RCLAgent decomposes the diagnostic process along the trace graph by assigning each span to a Dedicated Agent and organizing agents recursively and in parallel according to the graph topology, with the final diagnosis obtained by synthesizing the Root-Level Diagnosis Report and the Global Evidence Graph. Extensive experiments on multiple public benchmarks demonstrate that RCLAgent consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.
Problem

Research questions and friction points this paper is trying to address.

root cause localization
microservices
large language models
context explosion
reasoning efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent
recursion-of-thought
root cause localization
microservices
parallel reasoning