ActionNex: A Virtual Outage Manager for Cloud

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of large-scale cloud operations fault management, which heavily relies on human expertise and suffers from incomplete information and complex cross-team coordination. To overcome these limitations, this work proposes the first end-to-end intelligent agent system integrating a hierarchical memory mechanism—comprising long-term knowledge from known error databases (KCA), episodic memory, and working memory. The system processes multimodal operational signals to generate critical events and delivers context-aware, real-time decision recommendations through conditional matching-based reasoning. It further incorporates a human-in-the-loop feedback mechanism to enable continuous co-evolution with operators. Evaluated across eight real-world Azure incident scenarios, the system achieves a precision of 71.4% and recall between 52.8% and 54.8%. It has been piloted in production environments and received positive operational feedback.
📝 Abstract
Outage management in large-scale cloud operations remains heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability. We present \textbf{ActionNex}, a production-grade agentic system that supports end-to-end outage assistance, including real-time updates, knowledge distillation, and role- and stage-conditioned next-best action recommendations. ActionNex ingests multimodal operational signals (e.g., outage content, telemetry, and human communications) and compresses them into critical events that represent meaningful state transitions. It couples this perception layer with a hierarchical memory subsystem: long-term Key-Condition-Action (KCA) knowledge distilled from playbooks and historical executions, episodic memory of prior outages, and working memory of the live context. A reasoning agent aligns current critical events to preconditions, retrieves relevant memories, and generates actionable recommendations; executed human actions serve as an implicit feedback signal to enable continual self-evolution in a human-agent hybrid system. We evaluate ActionNex on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4\% precision and 52.8-54.8\% recall. The system has been piloted in production and has received positive early feedback.
Problem

Research questions and friction points this paper is trying to address.

cloud outage management
partial observability
manual triage
cross-team coordination
experience-driven decisions
Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic system
multimodal signal fusion
hierarchical memory
next-best action recommendation
human-agent hybrid learning
🔎 Similar Papers
No similar papers found.