🤖 AI Summary
The root cause analysis (RCA) community suffers from a critical shortage of large-scale, open-source, multimodal benchmark datasets, hindering rigorous method evaluation and advancement. To address this, we introduce LEMMA-RCA—the first cross-domain, multimodal, open-source RCA dataset tailored for IT/OT systems, encompassing realistic failure scenarios from microservices, water supply, and wastewater treatment. It comprises hundreds of system entities and fine-grained causal relationship annotations. Uniquely integrating four heterogeneous modalities—time-series metrics, logs, topology graphs, and alerts—it supports both offline/online and unimodal/multimodal RCA evaluation. Built via distributed monitoring, multi-source alignment, controllable fault injection, and causal graph annotation, the dataset ensures high fidelity and representativeness. Extensive evaluation across eight baseline methods demonstrates that multimodal joint modeling improves average F1-score by 23.6%. LEMMA-RCA is publicly released, establishing a new community benchmark for RCA research.
📝 Abstract
Root cause analysis (RCA) is crucial for enhancing the reliability and performance of complex systems. However, progress in this field has been hindered by the lack of large-scale, open-source datasets tailored for RCA. To bridge this gap, we introduce LEMMA-RCA, a large dataset designed for diverse RCA tasks across multiple domains and modalities. LEMMA-RCA features various real-world fault scenarios from IT and OT operation systems, encompassing microservices, water distribution, and water treatment systems, with hundreds of system entities involved. We evaluate the quality of LEMMA-RCA by testing the performance of eight baseline methods on this dataset under various settings, including offline and online modes as well as single and multiple modalities. Our experimental results demonstrate the high quality of LEMMA-RCA. The dataset is publicly available at https://lemma-rca.github.io/.