🤖 AI Summary
To address the explainability bottleneck in multi-agent workflows—characterized by difficulties in observing, attributing, and repairing failures—this paper introduces the first integrated debugging framework combining interactive log visualization, human feedback loops, and LLM-as-a-judge automated error detection. It establishes a human-centered paradigm for multi-agent debugging. The framework enables users with diverse technical backgrounds to track execution traces in real time, collaboratively annotate anomalies, and leverage large language models for fine-grained error classification and root-cause attribution. A user study demonstrates that the tool significantly improves fault localization efficiency (reducing time by 57% on average) and attribution accuracy (+39%). Moreover, it supports iterative workflow configuration refinement grounded in human feedback. Empirical evaluation on real-world multi-agent workflows confirms both practical utility and generalizability.
📝 Abstract
As multi-agent systems powered by Large Language Models (LLMs) are increasingly adopted in real-world workflows, users with diverse technical backgrounds are now building and refining their own agentic processes. However, these systems can fail in opaque ways, making it difficult for users to observe, understand, and correct errors. We conducted formative interviews with 12 practitioners to identify mismatches between existing observability tools and users' needs. Based on these insights, we designed XAgen, an explainability tool that supports users with varying AI expertise through three core capabilities: log visualization for glanceable workflow understanding, human-in-the-loop feedback to capture expert judgment, and automatic error detection via an LLM-as-a-judge. In a user study with 8 participants, XAgen helped users more easily locate failures, attribute to specific agents or steps, and iteratively improve configurations. Our findings surface human-centered design guidelines for explainable agentic AI development and highlights opportunities for more context-aware interactive debugging.