🤖 AI Summary
This work addresses the frequent failure of large language model (LLM) agents in complex tasks due to runtime errors, a challenge exacerbated by the absence of systematic repair mechanisms and standardized evaluation benchmarks. The authors present the first comprehensive analysis of real-world error cases sourced from platforms like Stack Overflow and GitHub, distilling common repair patterns and introducing AgentDefect—the first defect benchmark dataset tailored for LLM agents. They propose SelfHeal, a multi-agent automated repair framework that leverages a dual ReAct agent architecture协同ating with an internal rule repository and external web search to generate and validate fixes. Experimental results demonstrate that SelfHeal significantly outperforms existing approaches on AgentDefect, successfully repairing 37 real-world defects and thereby validating its effectiveness and practical utility.
📝 Abstract
Large Language Models (LLMs) have transformed software development and AI applications. While LLMs are designed for text processing, LLM agents extend this capability by enabling autonomous actions, tool use, and multi-step task completion. As this field grows, developers face new challenges in debugging these complex systems. To address this challenge, we present the first empirical study on bug fix patterns in LLM agents. We study buggy posts and code snippets from three platforms: Stack Overflow, GitHub, and HuggingFace Forums. We examine their fix patterns, the components where fixes are applied, and the programming languages and frameworks involved. Furthermore, we introduce AgentDefect, the first benchmark dataset for bugs in LLM agents. The dataset contains 37 runtime buggy instances along with fixed code and test files. Finally, we present SelfHeal, a multi-agent system designed to fix bugs in LLM agents. The system leverages two independent ReAct agents: the fix agent and the critic agent. These agents use tools that provide both internal knowledge (fix rules) and external knowledge (web search) to propose and validate fixes. Our evaluation shows that SelfHeal with Gemini 3 Pro as the backbone LLM outperforms both baseline and state-of-the-art approaches by a significant margin.