Collaborative AI Agents and Critics for Fault Detection and Cause Analysis in Network Telemetry

📅 2026-03-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multimodal collaborative intelligence for fault detection, severity assessment, and root cause analysis in network telemetry by proposing a communication-efficient federated multi-agent framework that preserves the privacy of local cost functions. The framework employs multiple actors (task-performing agents) and critics (feedback-providing evaluators), coordinated through a central server without requiring direct communication among agents or critics. This design ensures that communication overhead scales linearly with the number of modalities and remains independent of the number of agents. By integrating multi-timescale stochastic approximation, classical machine learning, and generative AI foundation models, the system achieves privacy-preserving collaborative optimization. Empirical validation on network telemetry tasks demonstrates its effectiveness, with theoretical guarantees on the convergence of time-averaged active states.
📝 Abstract
We develop algorithms for collaborative control of AI agents and critics in a multi-actor, multi-critic federated multi-agent system. Each AI agent and critic has access to classical machine learning or generative AI foundation models. The AI agents and critics collaborate with a central server to complete multimodal tasks such as fault detection, severity, and cause analysis in a network telemetry system, text-to-image generation, video generation, healthcare diagnostics from medical images and patient records, etcetera. The AI agents complete their tasks and send them to AI critics for evaluation. The critics then send feedback to agents to improve their responses. Collaboratively, they minimize the overall cost to the system with no inter-agent or inter-critic communication. AI agents and critics keep their cost functions or derivatives of cost functions private. Using multi-time scale stochastic approximation techniques, we provide convergence guarantees on the time-average active states of AI agents and critics. The communication overhead is a little on the system, of the order of $\mathcal{O}(m)$, for $m$ modalities and is independent of the number of AI agents and critics. Finally, we present an example of fault detection, severity, and cause analysis in network telemetry and thorough evaluation to check the algorithm's efficacy.
Problem

Research questions and friction points this paper is trying to address.

fault detection
cause analysis
network telemetry
severity assessment
Innovation

Methods, ideas, or system contributions that make the work stand out.

collaborative AI agents
multi-critic federated system
multi-time scale stochastic approximation
privacy-preserving cost optimization
network telemetry fault analysis
🔎 Similar Papers
No similar papers found.
S
Syed Eqbal Alam
Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Alberta, Canada; SheQAI Research, Edmonton, Alberta, Canada
Zhan Shu
Zhan Shu
Professor, University of Alberta
controlcontrol engineeringcontrol theory