ActionNex: A Virtual Outage Manager for Cloud

📅 2026-04-03

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This study addresses the challenges of large-scale cloud operations fault management, which heavily relies on human expertise and suffers from incomplete information and complex cross-team coordination. To overcome these limitations, this work proposes the first end-to-end intelligent agent system integrating a hierarchical memory mechanism—comprising long-term knowledge from known error databases (KCA), episodic memory, and working memory. The system processes multimodal operational signals to generate critical events and delivers context-aware, real-time decision recommendations through conditional matching-based reasoning. It further incorporates a human-in-the-loop feedback mechanism to enable continuous co-evolution with operators. Evaluated across eight real-world Azure incident scenarios, the system achieves a precision of 71.4% and recall between 52.8% and 54.8%. It has been piloted in production environments and received positive operational feedback.

Technology Category

Application Category

📝 Abstract

Outage management in large-scale cloud operations remains heavily manual, requiring rapid triage, cross-team coordination, and experience-driven decisions under partial observability. We present \textbf{ActionNex}, a production-grade agentic system that supports end-to-end outage assistance, including real-time updates, knowledge distillation, and role- and stage-conditioned next-best action recommendations. ActionNex ingests multimodal operational signals (e.g., outage content, telemetry, and human communications) and compresses them into critical events that represent meaningful state transitions. It couples this perception layer with a hierarchical memory subsystem: long-term Key-Condition-Action (KCA) knowledge distilled from playbooks and historical executions, episodic memory of prior outages, and working memory of the live context. A reasoning agent aligns current critical events to preconditions, retrieves relevant memories, and generates actionable recommendations; executed human actions serve as an implicit feedback signal to enable continual self-evolution in a human-agent hybrid system. We evaluate ActionNex on eight real Azure outages (8M tokens, 4,000 critical events) using two complementary ground-truth action sets, achieving 71.4\% precision and 52.8-54.8\% recall. The system has been piloted in production and has received positive early feedback.

Problem

Research questions and friction points this paper is trying to address.

cloud outage management

partial observability

manual triage

cross-team coordination

experience-driven decisions

Innovation

Methods, ideas, or system contributions that make the work stand out.

agentic system

multimodal signal fusion

hierarchical memory