Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the safety and reliability challenges of autonomous decision-making by large language models in NetOps/AIOps by proposing a constrained autonomy–centric agent-based operations framework. The framework defines clear boundaries for agent observation, proposal, and execution through enforceable contracts and integrates evidence collection, policy adherence, access control, and rollback mechanisms to establish a closed-loop workflow spanning diagnosis, root cause analysis, configuration generation, and limited self-healing. Full-process safety and controllability are achieved via sandbox replay, canary testing, constrained tool invocation, and auditable traceability. By establishing design principles for reliable, auditable, and secure deployment of intelligent operations, this study advances evaluation criteria from static question-answering toward end-to-end robustness and safety verification.

📝 Abstract

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

Problem

Research questions and friction points this paper is trying to address.

Agentic NetOps

AIOps

Large Language Models

Operational Safety

Autonomy Constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic NetOps

Assurance Contracts

Workflow-centered Evaluation