LumiMAS: A Comprehensive Framework for Real-Time Monitoring and Enhanced Observability in Multi-Agent Systems

📅 2025-08-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
System-level failures in large language model (LLM)-powered multi-agent systems (MAS) are difficult to monitor and attribute due to the distributed, collaborative nature of agent interactions. Method: This paper proposes the first end-to-end, real-time observability framework tailored for LLM-MAS collaboration. Departing from conventional single-agent monitoring paradigms, it unifies the entire MAS workflow into a coherent observational space, implementing a three-tier architecture: log collection, real-time anomaly detection, and taxonomy-guided root-cause analysis. Crucially, it enables cross-agent joint anomaly detection and system-level root-cause localization. Results: Evaluated across seven representative LLM-MAS applications, the framework significantly improves system-level anomaly detection rates and fault localization accuracy, while enhancing MAS interpretability and operational reliability.

Technology Category

Application Category

📝 Abstract
The incorporation of large language models in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.
Problem

Research questions and friction points this paper is trying to address.

Monitoring and interpreting failures in multi-agent systems with LLMs
Detecting anomalies across entire MAS workflows in real-time
Explaining root causes of anomalies in multi-agent systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Real-time monitoring for multi-agent systems
Anomaly detection across MAS workflow
Root cause analysis for detected anomalies
🔎 Similar Papers
No similar papers found.
R
Ron Solomon
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
Y
Yarin Yerushalmi Levi
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
L
Lior Vaknin
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
E
Eran Aizikovich
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
A
Amit Baras
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
E
Etai Ohana
Department of Software and Information Systems Engineering, Ben-Gurion University of the Negev, Israel
Amit Giloni
Amit Giloni
Ben-Gurion University of the Negev
S
Shamik Bose
Fujitsu Research of Europe
C
Chiara Picardi
Fujitsu Research of Europe
Yuval Elovici
Yuval Elovici
Head of Cyber@BGU, Director of Telekom Innovation Laboratories at BGU, Ben Gurion University
Computer and Network SecurityCyber Security
Asaf Shabtai
Asaf Shabtai
Software and Information Systems Engineering, Telekom Innovation Labs, Ben Gurion University
Computer and network securitymachine learning