Agentic Memory Enhanced Recursive Reasoning for Root Cause Localization in Microservices

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the growing complexity of microservice systems, where traditional root cause localization methods struggle to adapt to dynamic environments, and existing large language model (LLM)-based approaches are hindered by shallow reasoning and redundant computation across alerts. Inspired by the diagnostic behaviors of Site Reliability Engineering (SRE) experts, we propose a recursive reasoning framework based on a multi-agent LLM architecture, enhanced with an intelligent memory mechanism and cross-alert context reuse. This design enables incremental reuse and multi-dimensional expansion of the reasoning process, significantly improving both the accuracy and efficiency of root cause localization. Experimental results demonstrate that our method outperforms current state-of-the-art approaches.

Technology Category

Application Category

📝 Abstract

As contemporary microservice systems become increasingly popular and complex-often comprising hundreds or even thousands of fine-grained, interdependent subsystems-they are experiencing more frequent failures. Ensuring system reliability thus demands accurate root cause localization. While many traditional graph-based and deep learning approaches have been explored for this task, they often rely heavily on pre-defined schemas that struggle to adapt to evolving operational contexts. Consequently, a number of LLM-based methods have recently been proposed. However, these methods still face two major limitations: shallow, symptom-centric reasoning that undermines accuracy, and a lack of cross-alert reuse that leads to redundant reasoning and high latency. In this paper, we conduct a comprehensive study of how Site Reliability Engineers (SREs) localize the root causes of failures, drawing insights from professionals across multiple organizations. Our investigation reveals that expert root cause analysis exhibits three key characteristics: recursiveness, multi-dimensional expansion, and cross-modal reasoning. Motivated by these findings, we introduce AMER-RCL, an agentic memory enhanced recursive reasoning framework for root cause localization in microservices. AMER-RCL employs the Recursive Reasoning RCL engine, a multi-agent framework that performs recursive reasoning on each alert to progressively refine candidate causes, while Agentic Memory incrementally accumulates and reuses reasoning from prior alerts within a time window to reduce redundant exploration and lower inference latency. Experimental results demonstrate that AMER-RCL consistently outperforms state-of-the-art methods in both localization accuracy and inference efficiency.

Problem

Research questions and friction points this paper is trying to address.

root cause localization

microservices

recursive reasoning

agentic memory

failure diagnosis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Recursive Reasoning

Agentic Memory

Root Cause Localization