Where LLM Agents Fail and How They can Learn From Failures

๐Ÿ“… 2025-09-29
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large language model (LLM)-based agents frequently fail in multi-step tasks due to error propagation from isolated failures, and existing systems lack systematic error attribution capabilities. Method: This paper introduces AgentDebug, the first modular framework for agent failure analysis. It comprises: (1) a fine-grained AgentErrorTaxonomy classifying errors across four core modulesโ€”memory, planning, reflection, and tool usage; (2) AgentErrorBench, the first real-world, human-annotated error trajectory dataset; and (3) an iterative debugging framework enabling root-cause localization and recovery. Results: Experiments on ALFWorld, GAIA, and WebShop demonstrate that AgentDebug improves end-to-end task accuracy by 24%, step-level accuracy by 17%, and task success rate by up to 26% over the strongest baseline. The framework significantly enhances agent debuggability and supports continuous improvement.

Technology Category

Application Category

๐Ÿ“ Abstract
Large Language Model (LLM) agents, which integrate planning, memory, reflection, and tool-use modules, have shown promise in solving complex, multi-step tasks. Yet their sophisticated architectures amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions, leading to task failure. Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way, and therefore fail to detect these errors accordingly. We address this gap with three contributions. First, we introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations. Second, we construct AgentErrorBench, the first dataset of systematically annotated failure trajectories from ALFWorld, GAIA, and WebShop, grounding error analysis in real-world agent rollouts. Third, we propose AgentDebug, a debugging framework that isolates root-cause failures and provides corrective feedback, enabling agents to recover and iteratively improve. Experiments on AgentErrorBench show that AgentDebug achieves 24% higher all-correct accuracy and 17% higher step accuracy compared to the strongest baseline. Beyond detection, the targeted feedback generated by AgentDebug enables LLM agents to iteratively recover from failures, yielding up to 26% relative improvements in task success across ALFWorld, GAIA, and WebShop. These results establish principled debugging as a pathway to more reliable and adaptive LLM agents. The code and data will be available at https://github.com/ulab-uiuc/AgentDebug
Problem

Research questions and friction points this paper is trying to address.

LLM agents lack modular error detection for cascading failures
Current systems cannot comprehensively analyze agent error propagation
Missing framework for isolating root-cause failures in multi-step tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces AgentErrorTaxonomy for failure classification
Develops AgentErrorBench dataset with annotated failures
Proposes AgentDebug framework for root-cause isolation
๐Ÿ”Ž Similar Papers
No similar papers found.