Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

📅 2026-04-24
📈 Citations: 0
Influential: 0
📄 PDF

career value

211K/year
🤖 AI Summary
This work addresses the challenge of fault attribution in large language model (LLM)-based multi-agent systems, where natural language reasoning, output stochasticity, and complex agent interactions hinder effective debugging. To bridge this gap, the authors introduce TraceElephant—the first benchmark that supports full execution trace recording and reproducible environments for fault attribution, enabling the first study of attribution under fully observable conditions that closely mirror real-world debugging scenarios. Through a systematic evaluation framework and a carefully constructed multi-agent environment, experiments demonstrate that leveraging complete execution traces—rather than relying solely on final outputs—can improve attribution accuracy by up to 76%. The findings further reveal that missing inputs or contextual information often obscure the true root causes of failures.

Technology Category

Application Category

📝 Abstract
Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques. Yet existing benchmarks rely on partially observable traces that capture only agent outputs, omitting the inputs and context that developers actually use when debugging. We argue that failure attribution should be studied under full execution observability, aligning with real-world developer-facing scenarios where complete traces, rather than only outputs, are accessible for diagnosis. To this end, we introduce TraceElephant, a benchmark designed for failure attribution with full execution traces and reproducible environments. We then systematically evaluate failure attribution techniques across various configurations. Specifically, full traces improve attribution accuracy by up to 76\% over a partial-observation counterpart, confirming that missing inputs obscure many failure causes. TraceElephant provides a foundation for follow-up failure attribution research, promoting evaluation practices that reflect real-world debugging and supporting the development of more transparent MASs.
Problem

Research questions and friction points this paper is trying to address.

failure attribution
LLM-based multi-agent systems
execution traces
debugging
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

failure attribution
multi-agent systems
full execution trace
benchmark
LLM-based reasoning
🔎 Similar Papers
No similar papers found.
M
Mengzhuo Chen
State Key Laboratory of Complex System Modeling and Simulation Technology; Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
Junjie Wang
Junjie Wang
Institute of Software, Chinese Academy of Sciences
Software Engineering
F
Fangwen Mu
State Key Laboratory of Complex System Modeling and Simulation Technology; Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
Yawen Wang
Yawen Wang
The University of Texas at Arlington
Gear DynamicsNoise and Vibration
Z
Zhe Liu
State Key Laboratory of Complex System Modeling and Simulation Technology; Institute of Software, Chinese Academy of Sciences
H
Huanxiang Feng
State Key Laboratory of Complex System Modeling and Simulation Technology; Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences, Beijing, China
Qing Wang
Qing Wang
Institute of Software Chinese Academy of Sciences
Software engineering