🤖 AI Summary
System-level failures in large language model (LLM)-powered multi-agent systems (MAS) are difficult to monitor and attribute due to the distributed, collaborative nature of agent interactions. Method: This paper proposes the first end-to-end, real-time observability framework tailored for LLM-MAS collaboration. Departing from conventional single-agent monitoring paradigms, it unifies the entire MAS workflow into a coherent observational space, implementing a three-tier architecture: log collection, real-time anomaly detection, and taxonomy-guided root-cause analysis. Crucially, it enables cross-agent joint anomaly detection and system-level root-cause localization. Results: Evaluated across seven representative LLM-MAS applications, the framework significantly improves system-level anomaly detection rates and fault localization accuracy, while enhancing MAS interpretability and operational reliability.
📝 Abstract
The incorporation of large language models in multi-agent systems (MASs) has the potential to significantly improve our ability to autonomously solve complex problems. However, such systems introduce unique challenges in monitoring, interpreting, and detecting system failures. Most existing MAS observability frameworks focus on analyzing each individual agent separately, overlooking failures associated with the entire MAS. To bridge this gap, we propose LumiMAS, a novel MAS observability framework that incorporates advanced analytics and monitoring techniques. The proposed framework consists of three key components: a monitoring and logging layer, anomaly detection layer, and anomaly explanation layer. LumiMAS's first layer monitors MAS executions, creating detailed logs of the agents' activity. These logs serve as input to the anomaly detection layer, which detects anomalies across the MAS workflow in real time. Then, the anomaly explanation layer performs classification and root cause analysis (RCA) of the detected anomalies. LumiMAS was evaluated on seven different MAS applications, implemented using two popular MAS platforms, and a diverse set of possible failures. The applications include two novel failure-tailored applications that illustrate the effects of a hallucination or bias on the MAS. The evaluation results demonstrate LumiMAS's effectiveness in failure detection, classification, and RCA.