LEMMA-RCA: A Large Multi-modal Multi-domain Dataset for Root Cause Analysis

📅 2024-06-08
🏛️ arXiv.org
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
The root cause analysis (RCA) community suffers from a critical shortage of large-scale, open-source, multimodal benchmark datasets, hindering rigorous method evaluation and advancement. To address this, we introduce LEMMA-RCA—the first cross-domain, multimodal, open-source RCA dataset tailored for IT/OT systems, encompassing realistic failure scenarios from microservices, water supply, and wastewater treatment. It comprises hundreds of system entities and fine-grained causal relationship annotations. Uniquely integrating four heterogeneous modalities—time-series metrics, logs, topology graphs, and alerts—it supports both offline/online and unimodal/multimodal RCA evaluation. Built via distributed monitoring, multi-source alignment, controllable fault injection, and causal graph annotation, the dataset ensures high fidelity and representativeness. Extensive evaluation across eight baseline methods demonstrates that multimodal joint modeling improves average F1-score by 23.6%. LEMMA-RCA is publicly released, establishing a new community benchmark for RCA research.

Technology Category

Application Category

📝 Abstract
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale open-source datasets for root cause analysis
Need for diverse RCA tasks across multiple domains and modalities
Absence of real-world fault scenarios from IT and OT systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large multi-modal multi-domain dataset for RCA
Includes real-world fault scenarios from IT and OT systems
Evaluated with eight baseline methods in various settings
🔎 Similar Papers
No similar papers found.