A Survey on AgentOps: Categorization, Challenges, and Future Directions

πŸ“… 2025-08-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the frequent runtime anomalies in large language model (LLM)-based agent systems and the absence of systematic operational methodologies, this paper presents the first comprehensive survey on AgentOpsβ€”the operations of intelligent agent systems. Through systematic literature review and anomaly taxonomy modeling, we formally define internal anomalies (e.g., reasoning hallucinations, tool invocation failures) and external anomalies (e.g., API outages, environmental changes). Building upon this taxonomy, we propose a full-lifecycle AgentOps framework encompassing monitoring, anomaly detection, root-cause analysis, and autonomous recovery. This work establishes the first structured conceptual foundation for agent operations, fills a critical theoretical gap in the field, identifies key open challenges, and provides an extensible methodological basis and evolutionary roadmap for both academic research and industrial deployment.

Technology Category

Application Category

πŸ“ Abstract
As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause analysis, and resolution.
Problem

Research questions and friction points this paper is trying to address.

Addressing anomalies in LLM-based agent systems for stability
Establishing a framework for agent system operations (AgentOps)
Categorizing and resolving intra-agent and inter-agent anomalies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Defines intra-agent and inter-agent anomalies
Introduces AgentOps operational framework
Details four key stages of AgentOps
πŸ”Ž Similar Papers
No similar papers found.
Z
Zexin Wang
Computer Network Information Center, Chinese Academy of Sciences, China
J
Jingjing Li
Computer Network Information Center, Chinese Academy of Sciences, China
Q
Quan Zhou
Computer Network Information Center, Chinese Academy of Sciences, China
H
Haotian Si
Computer Network Information Center, Chinese Academy of Sciences, China
Yuanhao Liu
Yuanhao Liu
Institute of Computing Technology, Chinese Academy of Sciences
trustworthy AIfairness of algorithms
J
Jianhui Li
Computer Network Information Center, Chinese Academy of Sciences, China; Also with Nanjing University
G
Gaogang Xie
Computer Network Information Center, Chinese Academy of Sciences, China
F
Fei Sun
Institute of Computing Technology, Chinese Academy of Sciences, China
Dan Pei
Dan Pei
Associate Professor of Computer Science, Tsinghua University
AIOpsTime Series Intelligence
C
Changhua Pei
Computer Network Information Center, Chinese Academy of Sciences, China