When Agents Fail: A Comprehensive Study of Bugs in LLM Agents with Automated Labeling

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This study addresses the critical yet underexplored challenge of debugging defects in large language model (LLM) agents, which are often difficult to diagnose due to a lack of systematic understanding of their types, root causes, and impacts. To bridge this gap, we present the first comprehensive taxonomy and analysis of LLM agent defects, based on 1,187 real-world instances collected from Stack Overflow, GitHub, and Hugging Face. We investigate defect categories, root causes, effects, and the components in which they occur. To enable scalable analysis, we propose BugReAct—a novel automated defect annotation method built upon the ReAct framework—that leverages the Gemini 2.5 Flash model in conjunction with external tools to achieve high-accuracy labeling at low cost (averaging $0.01 per instance), thereby demonstrating the feasibility of automated identification and annotation of LLM agent defects.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have revolutionized intelligent application development. While standalone LLMs cannot perform any actions, LLM agents address the limitation by integrating tools. However, debugging LLM agents is difficult and costly as the field is still in it's early stage and the community is underdeveloped. To understand the bugs encountered during agent development, we present the first comprehensive study of bug types, root causes, and effects in LLM agent-based software. We collected and analyzed 1,187 bug-related posts and code snippets from Stack Overflow, GitHub, and Hugging Face forums, focused on LLM agents built with seven widely used LLM frameworks as well as custom implementations. For a deeper analysis, we have also studied the component where the bug occurred, along with the programming language and framework. This study also investigates the feasibility of automating bug identification. For that, we have built a ReAct agent named BugReAct, equipped with adequate external tools to determine whether it can detect and annotate the bugs in our dataset. According to our study, we found that BugReAct equipped with Gemini 2.5 Flash achieved a remarkable performance in annotating bug characteristics with an average cost of 0.01 USD per post/code snippet.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

bugs

debugging

error analysis

intelligent agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM agents

bug analysis

automated labeling