🤖 AI Summary
Dynamic microservice orchestration in cloud-native environments causes operational data explosion, leading to information overload, inefficient multi-task coordination, and fragmented contextual continuity for fault diagnosis. To address these challenges, we propose AOI, a multi-agent collaborative framework featuring a novel three-tier memory architecture—integrating working memory, episodic memory, and semantic memory—and incorporating an LLM-driven, context-aware compression mechanism alongside a real-time state-guided dynamic task prioritization scheduler. This design ensures contextual continuity and adaptive decision-making. Experimental evaluation demonstrates a 72.4% context compression rate with 92.8% critical information retention; task success rate reaches 94.2%, and mean time to repair (MTTR) is reduced by 34.4%. AOI establishes a scalable, interpretable collaborative paradigm for intelligent cloud-native operations.
📝 Abstract
The proliferation of cloud-native architectures, characterized by microservices and dynamic orchestration, has rendered modern IT infrastructures exceedingly complex and volatile. This complexity generates overwhelming volumes of operational data, leading to critical bottlenecks in conventional systems: inefficient information processing, poor task coordination, and loss of contextual continuity during fault diagnosis and remediation. To address these challenges, we propose AOI (AI-Oriented Operations), a novel multi-agent collaborative framework that integrates three specialized agents with an LLM-based Context Compressor. Its core innovations include: (1) a dynamic task scheduling strategy that adaptively prioritizes operations based on real-time system states, and (2) a three-layer memory architecture comprising Working, Episodic, and Semantic layers that optimizes context retention and retrieval. Extensive experiments on both synthetic and real-world benchmarks demonstrate that AOI effectively mitigates information overload, achieving a 72.4% context compression ratio while preserving 92.8% of critical information and significantly enhances operational efficiency, attaining a 94.2% task success rate and reducing the Mean Time to Repair (MTTR) by 34.4% compared to the best baseline. This work presents a paradigm shift towards scalable, adaptive, and context-aware autonomous operations, enabling robust management of next-generation IT infrastructures with minimal human intervention.