🤖 AI Summary
Cloud-native system failure root cause analysis faces challenges including difficulty in fusing heterogeneous observability data (e.g., logs, metrics, topology, alerts) and low diagnostic efficiency. To address these, we propose ARCA—a multimodal RAG-augmented large language model system specifically designed for cloud failure diagnosis. ARCA introduces the first cross-modal retrieval architecture that jointly embeds and aligns logs, metrics, service topology, and alerts; it further employs progressive reasoning evaluation to support natural-language–driven, interactive troubleshooting. Evaluated across diverse real-world failure scenarios, ARCA achieves a 27% average accuracy improvement and reduces mean resolution time by 41% compared to state-of-the-art baselines. Its unified multimodal grounding and interpretable reasoning chain significantly enhance both diagnostic efficiency and explainability in complex, dynamic cloud environments.
📝 Abstract
Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.