Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This study addresses the low accuracy of large language model–based AI agents in root cause analysis (RCA) for cloud systems and the lack of fine-grained evaluation methods that can pinpoint the origins of reasoning failures. The authors present the first fine-grained failure taxonomy specifically designed for cloud RCA agents, identifying and categorizing twelve systematic failure modes through the OpenRCA benchmark. These failure modes span internal reasoning, multi-agent communication, and environment interaction. The analysis reveals that hallucinated data interpretation and insufficient exploration are the most prevalent issues. While optimizing communication protocols reduces communication-related failures by up to 15 percentage points, prompt engineering alone yields limited gains. The findings indicate that the primary source of failure lies in inherent limitations of general-purpose agent architectures rather than differences in underlying model capabilities.

Technology Category

Application Category

📝 Abstract

Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent's reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.

Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis

LLM agents

failure analysis

cloud systems

reasoning pitfalls

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents

Root Cause Analysis

failure taxonomy