TORAI: Unsupervised Fine-grained RCA using Multi-Source Telemetry Data

📅 2026-04-15
📈 Citations: 0
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
This study addresses the limitations of existing root cause analysis methods for microservices, which rely on complete service call graphs and struggle in scenarios with missing trace data—commonly referred to as “blind spots.” To overcome this dependency, the authors propose an unsupervised, fine-grained root cause localization approach that operates without a service call graph. The method quantifies anomaly severity by integrating multi-source telemetry data, ranks candidate services through a combination of clustering and causal inference, and employs hypothesis testing to precisely identify true root causes. By eliminating the need for full call graph reconstruction, the approach maintains high accuracy even in complex environments with significant tracing gaps. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art techniques across three benchmark systems and real-world failure cases, achieving top-three root cause recommendation accuracy.

Technology Category

Application Category

📝 Abstract
Existing multi-source root cause analysis (RCA) methods for microservice systems assume all services have traces to construct a service call graph. However, this assumption is not practical as microservice systems evolve rapidly and may contain blackbox services without traces, such as compiled software or unsupported services. We refer to these services as blind spots. In the presence of blind spots, the performance of existing multi-source RCA methods may be affected, as they only diagnose visible services on the call graph. To overcome this limitation, we propose TORAI, a novel unsupervised approach that effectively pinpoints fine-grained root causes without relying on the service call graph. Instead, TORAI first measures anomaly severity using available multi-source telemetry data. It then performs clustering to group services based on their severity symptoms and conducts causal analysis to rank services within each severity cluster. Finally, TORAI aggregates the cluster rankings and uses hypothesis testing to identify fine-grained root causes. TORAI provides an unsupervised approach that leverages available multi-source telemetry data for RCA without requiring a constructed service call graph or further intrusive actions, thus addressing the limitations of existing methods. Our experiments on three benchmark systems demonstrate that TORAI outperforms state-of-the-art baselines remarkably in the presence of blind spots. Performance on real-world failures further shows that TORAI can accurately pinpoint the root causes in top-3 recommendations.
Problem

Research questions and friction points this paper is trying to address.

root cause analysis
microservice systems
blind spots
service call graph
multi-source telemetry data
Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised RCA
multi-source telemetry
blind spot services
causal analysis
anomaly severity clustering
🔎 Similar Papers