Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

๐Ÿ“… 2026-02-10
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

224K/year
๐Ÿค– AI Summary
This study addresses the low accuracy of large language modelโ€“based AI agents in root cause analysis (RCA) for cloud systems and the lack of fine-grained evaluation methods that can pinpoint the origins of reasoning failures. The authors present the first fine-grained failure taxonomy specifically designed for cloud RCA agents, identifying and categorizing twelve systematic failure modes through the OpenRCA benchmark. These failure modes span internal reasoning, multi-agent communication, and environment interaction. The analysis reveals that hallucinated data interpretation and insufficient exploration are the most prevalent issues. While optimizing communication protocols reduces communication-related failures by up to 15 percentage points, prompt engineering alone yields limited gains. The findings indicate that the primary source of failure lies in inherent limitations of general-purpose agent architectures rather than differences in underlying model capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.
Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis
LLM agents
failure analysis
cloud systems
reasoning pitfalls
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents
Root Cause Analysis
failure taxonomy
agent communication
process-level evaluation
๐Ÿ”Ž Similar Papers
No similar papers found.