ToolFuzz -- Automated Agent Tool Testing

📅 2025-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
LLM agents interact with environments via natural-language tool documentation, yet such documentation frequently suffers from over-specification, under-specification, or factual inaccuracies—leading to runtime and response errors; conventional software testing methods are ill-suited for detecting semantic defects in natural language. This paper introduces the first automated testing framework specifically designed for tool documentation, pioneering a semantics-aware fuzz testing methodology. Our approach integrates natural-language-driven test case synthesis, input-guided fuzz generation, and lightweight prompt augmentation to precisely identify both runtime failures and logical inconsistencies. Evaluated on 67 real-world tools (32 from LangChain and 35 custom-built), our method detects 20× more erroneous inputs than prompt-engineering baselines. Crucially, it provides the first systematic empirical evidence of widespread under-specification in tool documentation.

Technology Category

Application Category

📝 Abstract
Large Language Model (LLM) Agents leverage the advanced reasoning capabilities of LLMs in real-world applications. To interface with an environment, these agents often rely on tools, such as web search or database APIs. As the agent provides the LLM with tool documentation along the user query, the completeness and correctness of this documentation is critical. However, tool documentation is often over-, under-, or ill-specified, impeding the agent's accuracy. Standard software testing approaches struggle to identify these errors as they are expressed in natural language. Thus, despite its importance, there currently exists no automated method to test the tool documentation for agents. To address this issue, we present ToolFuzz, the first method for automated testing of tool documentations. ToolFuzz is designed to discover two types of errors: (1) user queries leading to tool runtime errors and (2) user queries that lead to incorrect agent responses. ToolFuzz can generate a large and diverse set of natural inputs, effectively finding tool description errors at a low false positive rate. Further, we present two straightforward prompt-engineering approaches. We evaluate all three tool testing approaches on 32 common LangChain tools and 35 newly created custom tools and 2 novel benchmarks to further strengthen the assessment. We find that many publicly available tools suffer from underspecification. Specifically, we show that ToolFuzz identifies 20x more erroneous inputs compared to the prompt-engineering approaches, making it a key component for building reliable AI agents.
Problem

Research questions and friction points this paper is trying to address.

Automated testing of tool documentation for LLM agents
Identifying errors in tool documentation causing runtime issues
Generating diverse inputs to detect incorrect agent responses
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated testing for tool documentation errors
Generates diverse natural language inputs
Identifies runtime and response errors effectively
🔎 Similar Papers
No similar papers found.
I
Ivan Milev
Department of Computer Science, ETH Zurich
Mislav Balunović
Mislav Balunović
ETH Zurich
Machine Learning
Maximilian Baader
Maximilian Baader
ETH Zurich
Machine Learning
M
Martin T. Vechev
Department of Computer Science, ETH Zurich