AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

This work addresses key deployment challenges faced by large language models in enterprise cloud diagnostics—namely data privacy concerns, permission constraints, and the inability to learn from failures—by proposing AOI, a trainable multi-agent framework. AOI formulates automated operations as a structured trajectory learning problem under safety constraints, featuring a novel read-write separated execution architecture and Group Relative Policy Optimization (GRPO) to enable local training without exposing sensitive data. It further introduces a closed-loop failure trajectory evolution mechanism that repurposes errors as supervisory signals. Experiments on the AIOpsLab benchmark show that AOI improves the top-5 success rate by 24.4 percentage points to 66.3%; a locally deployed 14B-parameter model achieves an average top-1 success rate of 42.9% on unseen fault tasks, surpassing Claude Sonnet 4.5. Leveraging failure trajectories boosts end-to-end performance by 4.8 points and reduces variance by 35%.

Technology Category

Application Category

📝 Abstract

Large language model (LLM) agents offer a promising data-driven approach to automating Site Reliability Engineering (SRE), yet their enterprise deployment is constrained by three challenges: restricted access to proprietary data, unsafe action execution under permission-governed environments, and the inability of closed systems to improve from failures. We present AOI (Autonomous Operations Intelligence), a trainable multi-agent framework formulating automated operations as a structured trajectory learning problem under security constraints. Our approach integrates three key components. First, a trainable diagnostic system applies Group Relative Policy Optimization (GRPO) to distill expert-level knowledge into locally deployed open-source models, enabling preference-based learning without exposing sensitive data. Second, a read-write separated execution architecture decomposes operational trajectories into observation, reasoning, and action phases, allowing safe learning while preventing unauthorized state mutation. Third, a Failure Trajectory Closed-Loop Evolver mines unsuccessful trajectories and converts them into corrective supervision signals, enabling continual data augmentation. Evaluated on the AIOpsLab benchmark, our contributions yield cumulative gains. (1) The AOI runtime alone achieves 66.3% best@5 success on all 86 tasks, outperforming the prior state-of-the-art (41.9%) by 24.4 points. (2) Adding Observer GRPO training, a locally deployed 14B model reaches 42.9% avg@1 on 63 held-out tasks with unseen fault types, surpassing Claude Sonnet 4.5. (3) The Evolver converts 37 failed trajectories into diagnostic guidance, improving end-to-end avg@5 by 4.8 points while reducing variance by 35%.

Problem

Research questions and friction points this paper is trying to address.

autonomous cloud diagnosis

failure learning

secure LLM agents

Site Reliability Engineering

proprietary data constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Group Relative Policy Optimization

read-write separated execution

failure trajectory learning