Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs

📅 2025-03-30
🏛️ EuroMLSys
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Cloud-native system failure root cause analysis faces challenges including difficulty in fusing heterogeneous observability data (e.g., logs, metrics, topology, alerts) and low diagnostic efficiency. To address these, we propose ARCA—a multimodal RAG-augmented large language model system specifically designed for cloud failure diagnosis. ARCA introduces the first cross-modal retrieval architecture that jointly embeds and aligns logs, metrics, service topology, and alerts; it further employs progressive reasoning evaluation to support natural-language–driven, interactive troubleshooting. Evaluated across diverse real-world failure scenarios, ARCA achieves a 27% average accuracy improvement and reduces mean resolution time by 41% compared to state-of-the-art baselines. Its unified multimodal grounding and interpretable reasoning chain significantly enhance both diagnostic efficiency and explainability in complex, dynamic cloud environments.

Technology Category

Application Category

📝 Abstract
Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.
Problem

Research questions and friction points this paper is trying to address.

Diagnosing cloud platform instability causes efficiently
Simplifying problem identification with multi-modal RAG LLMs
Improving resolution accuracy compared to existing methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal RAG LLM for cloud instability
Combines AI pattern matching with RAG
ARCA outperforms state-of-the-art alternatives
🔎 Similar Papers
No similar papers found.
Y
Yifan Wang
Department of Computer Science, Cornell University, Ithaca, New York, USA
Kenneth P. Birman
Kenneth P. Birman
Cornell University
Systems support for MLCloud ComputingDistributed SystemsFault ToleranceSecurity