Diagnosing and Resolving Cloud Platform Instability with Multi-modal RAG LLMs

📅 2025-03-30

🏛️ EuroMLSys

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Cloud-native system failure root cause analysis faces challenges including difficulty in fusing heterogeneous observability data (e.g., logs, metrics, topology, alerts) and low diagnostic efficiency. To address these, we propose ARCA—a multimodal RAG-augmented large language model system specifically designed for cloud failure diagnosis. ARCA introduces the first cross-modal retrieval architecture that jointly embeds and aligns logs, metrics, service topology, and alerts; it further employs progressive reasoning evaluation to support natural-language–driven, interactive troubleshooting. Evaluated across diverse real-world failure scenarios, ARCA achieves a 27% average accuracy improvement and reduces mean resolution time by 41% compared to state-of-the-art baselines. Its unified multimodal grounding and interpretable reasoning chain significantly enhance both diagnostic efficiency and explainability in complex, dynamic cloud environments.

Technology Category

Application Category

📝 Abstract

Today's cloud-hosted applications and services are complex systems, and a performance or functional instability can have dozens or hundreds of potential root causes. Our hypothesis is that by combining the pattern matching capabilities of modern AI tools with a natural multi-modal RAG LLM interface, problem identification and resolution can be simplified. ARCA is a new multi-modal RAG LLM system that targets this domain. Step-wise evaluations show that ARCA outperforms state-of-the-art alternatives.

Problem

Research questions and friction points this paper is trying to address.

Diagnosing cloud platform instability causes efficiently

Simplifying problem identification with multi-modal RAG LLMs

Improving resolution accuracy compared to existing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal RAG LLM for cloud instability

Combines AI pattern matching with RAG

ARCA outperforms state-of-the-art alternatives

🔎 Similar Papers

No similar papers found.