🤖 AI Summary
This paper addresses the problem of execution failures in agent-based external tool invocation—particularly with kubectl—caused by implicit semantic errors rather than syntactic ones. To tackle this, we propose a post-execution feedback-driven error repair framework that synergistically integrates large language models’ reflective reasoning with retrieval-augmented generation (RAG) grounded in tool documentation and domain-specific operational troubleshooting knowledge. Unlike conventional syntax-oriented validation, our method dynamically retrieves relevant domain knowledge from actual tool responses and generates semantically grounded corrective actions, thereby enabling strong domain awareness. Experimental results demonstrate a 55% improvement in kubectl command success rate, a 36% increase in accurate user request fulfillment, and an additional 10% gain in task completion rate achieved by our curated troubleshooting knowledge base over official Kubernetes documentation.
📝 Abstract
Agentic systems interact with external systems by calling tools such as Python functions, REST API endpoints, or command line tools such as kubectl in Kubernetes. These tool calls often fail for various syntactic and semantic reasons. Some less obvious semantic errors can only be identified and resolved after analyzing the tool's response. To repair these errors, we develop a post-tool execution reflection component that combines large language model (LLM)-based reflection with domain-specific retrieval-augmented generation (RAG) using documents describing both the specific tool being called and troubleshooting documents related to the tool. For this paper, we focus on the use case of the kubectl command line tool to manage Kubernetes, a platform for orchestrating cluster applications. Through a larger empirical study and a smaller manual evaluation, we find that our RAG-based reflection will repair kubectl commands such that they are both more likely to successfully execute (pass rate) for 55% of our models evaluated and 36% more likely to correctly answer the user query on average. We find that troubleshooting documents improve pass rate compared to official documentation by an average of 10%.