STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the vulnerability of existing LLM-driven microservice root cause analysis (RCA) agents to early reasoning errors that lead to diagnostic failure and their inability to localize or correct such mistakes. The study introduces a novel formulation of RCA errors as stage-localizable reasoning flaws and proposes a structured four-stage framework—comprising evidence bundles, hypothesis sets, analytical structures, and decision reports. To enhance robustness, it integrates stage-level auditing, budget-aware fast-slow path routing, counterfactual candidate evaluation, and stage-specific patch replay mechanisms. Implemented via LangGraph, the system demonstrates significant improvements in root cause localization and fault classification accuracy on both public benchmarks and real-world production data, precisely identifying erroneous stages and successfully repairing most execution trajectories within one to two iterations, thereby substantially improving agent debuggability and self-repair capability.

📝 Abstract

LLM-based root cause analysis (RCA) agents have recently emerged as a promising paradigm for incident diagnosis in microservice AIOps. However, their reliability remains fragile: an error in early evidence collection, hypothesis formulation, or causal analysis can propagate through the reasoning trace and eventually corrupt the final diagnosis. In this paper, we present \textbf{STAR}, a \emph{Stage-attributed Triage and Repair} framework for repairing erroneous RCA traces. STAR explicitly decomposes an RCA workflow into four structured stages, namely \emph{Evidence Package} (EP), \emph{Hypothesis Set} (HS), \emph{Analysis Structure} (AS), and \emph{Decision Report} (DR), and treats agent failure as a stage-localizable reasoning bug rather than a monolithic end-to-end error. Built on top of LangGraph, STAR performs stage-wise auditing, budget-aware \emph{Fast/Slow Routing}, \emph{decisive stage localization via counterfactual candidate evaluation}, and stage-specific patch-and-replay repair. We evaluate STAR on a public large-scale benchmark and a real-world production dataset, using two RCA agent workflows and three foundation models. Experimental results show that STAR consistently improves both root cause localization and fault type classification over strong baselines. Moreover, STAR identifies the decisive faulty stage with high accuracy, repairs most initially incorrect traces within one or two replay rounds, and benefits substantially from both Fast/Slow Routing and counterfactual stage evaluation. These results suggest that explicitly modeling \emph{where} an RCA agent fails is an effective path toward reliable, debuggable, and self-repairing agentic RCA systems.

Problem

Research questions and friction points this paper is trying to address.

Root Cause Analysis

Microservices

LLM-based Agents

Reasoning Errors

AIOps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stage-attributed Triage

Self-repairing RCA Agents

Counterfactual Stage Localization