AidAI: Automated Incident Diagnosis for AI Workloads in the Cloud

📅 2025-06-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
AI cloud workloads are prone to failures due to high hardware utilization and prolonged training cycles; however, existing service-provider-centric failure management paradigms suffer from a knowledge gap between customers and providers, resulting in average troubleshooting durations of several days. To address this, we propose the first customer-facing, real-time autonomous diagnosis framework. It constructs a structured expert knowledge base offline from historical on-call records and, online, integrates rule-guided causal reasoning with a lightweight classification model to emulate human SREs’ iterative hypothesis-testing process—thereby shifting diagnostic authority to the customer. Evaluated on real-world Microsoft production workloads, our framework achieves a Micro-F1 score of 0.854 and a Macro-F1 score of 0.816, with negligible latency and resource overhead. Crucially, it significantly reduces mean time to resolution.

Technology Category

Application Category

📝 Abstract
AI workloads experience frequent incidents due to intensive hardware utilization and extended training times. The current incident management workflow is provider-centric, where customers report incidents and place the entire troubleshooting responsibility on the infrastructure provider. However, the inherent knowledge gap between customer and provider significantly impacts incident resolution efficiency. In AI infrastructure, incidents may take several days on average to mitigate, resulting in delays and productivity losses. To address these issues, we present AidAI, a customer-centric system that provides immediate incident diagnosis for customers and streamlines the creation of incident tickets for unresolved issues. The key idea of AidAI is to construct internal knowledge bases from historical on-call experiences during the offline phase and mimic the reasoning process of human experts to diagnose incidents through trial and error in the online phase. Evaluations using real-world incident records in Microsoft show that AidAI achieves an average Micro F1 score of 0.854 and Macro F1 score of 0.816 without significant overhead.
Problem

Research questions and friction points this paper is trying to address.

Frequent AI workload incidents due to hardware and training demands
Inefficient provider-centric incident resolution causing delays and productivity loss
Knowledge gap between customers and providers slows troubleshooting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Customer-centric automated incident diagnosis system
Internal knowledge bases from historical on-call experiences
Mimics human expert reasoning for incident diagnosis
🔎 Similar Papers
No similar papers found.