Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

📅 2025-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address severe hallucination in large language models (LLMs) and the low accuracy and divergent reasoning of ReAct-based frameworks in microservice root cause analysis (RCA), this paper proposes an SOP-flow-driven multi-agent RCA framework. Our key contributions are: (1) the first SOP-flow framework, enabling automatic retrieval, generation, and code generation of standard operating procedures (SOPs); (2) a collaborative multi-auxiliary-agent mechanism for noise filtering, dynamic search-space pruning, and adaptive reasoning termination; and (3) integration of tool-augmented reasoning with expert diagnostic logic constraints. Evaluated on a real-world microservice incident dataset, our method achieves an RCA accuracy of 64.01%, substantially outperforming the ReAct baseline (35.50%) and meeting industrial deployment requirements.

Technology Category

Application Category

📝 Abstract
In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method's 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.
Problem

Research questions and friction points this paper is trying to address.

Enhance Root Cause Analysis with LLM-based systems
Reduce hallucinations in ReAct framework for RCA
Improve accuracy in automated incident diagnosis
Innovation

Methods, ideas, or system contributions that make the work stand out.

SOP enhanced LLM multi-agent system
SOP flow framework for RCA
Auxiliary agents improve accuracy
🔎 Similar Papers
No similar papers found.
C
Changhua Pei
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences
Z
Zexin Wang
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences
F
Fengrui Liu
ByteDance, Beijing, China
Zeyan Li
Zeyan Li
ByteDance
AIOpsIntelligent OperationsSoftware Reliability
Y
Yang Liu
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences
X
Xiao He
ByteDance, Hangzhou, China
R
Rong Kang
ByteDance, Beijing, China
Tieying Zhang
Tieying Zhang
Research Scientist at Bytedance
AI for SystemsSystems for AI
J
Jianjun Chen
ByteDance, San Jose, United States
J
Jianhui Li
G
Gaogang Xie
Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
Dan Pei
Dan Pei
Associate Professor of Computer Science, Tsinghua University
AIOpsTime Series Intelligence