🤖 AI Summary
In large language model–driven multi-agent systems, fault attribution remains challenging and manual debugging is prohibitively costly. To address this, we propose FAMAS—the first spectrum-based fault attribution method tailored for multi-agent systems. FAMAS systematically replays execution trajectories and abstracts agent behaviors into fine-grained behavioral units that jointly encode agent roles, action semantics, and contextual information. It then introduces a novel suspiciousness metric that quantifies each behavioral unit’s contribution to task failure by integrating multi-round mutation-based execution with spectrum analysis. Evaluated on the Who-and-When benchmark, FAMAS significantly outperforms 12 baseline methods, achieving an average 32.7% improvement in fault localization accuracy. This enables effective automated debugging and facilitates robustness optimization of multi-agent systems.
📝 Abstract
Large Language Model Powered Multi-Agent Systems (MASs) are increasingly employed to automate complex real-world problems, such as programming and scientific discovery. Despite their promising, MASs are not without their flaws. However, failure attribution in MASs - pinpointing the specific agent actions responsible for failures - remains underexplored and labor-intensive, posing significant challenges for debugging and system improvement. To bridge this gap, we propose FAMAS, the first spectrum-based failure attribution approach for MASs, which operates through systematic trajectory replay and abstraction, followed by spectrum analysis.The core idea of FAMAS is to estimate, from variations across repeated MAS executions, the likelihood that each agent action is responsible for the failure. In particular, we propose a novel suspiciousness formula tailored to MASs, which integrates two key factor groups, namely the agent behavior group and the action behavior group, to account for the agent activation patterns and the action activation patterns within the execution trajectories of MASs. Through expensive evaluations against 12 baselines on the Who and When benchmark, FAMAS demonstrates superior performance by outperforming all the methods in comparison.