🤖 AI Summary
Existing LLM agents frequently fail when invoking enterprise REST APIs to execute complex workflows, primarily due to ambiguous API documentation, intricate input schemas, and nonstandardized response formats; current tool-use benchmarks inadequately reflect such real-world challenges. To address this, we propose the first REST API tool-readiness evaluation framework specifically designed for LLM agents. Our approach introduces a novel three-category error taxonomy—input misinterpretation, inconsistent output handling, and schema mismatch—and integrates API schema analysis, automated test case generation, natural language instruction synthesis, and tool definition enhancement. Evaluated on 750 systematically constructed test cases, our framework identifies prevalent failure modes, enabling rapid API diagnostics and targeted toolification. Experimental results demonstrate substantial improvements in invocation success rate and robustness across diverse enterprise APIs.
📝 Abstract
Large Language Models (LLMs) are enabling autonomous agents to perform complex workflows using external tools or functions, often provided via REST APIs in enterprise systems. However, directly utilizing these APIs as tools poses challenges due to their complex input schemas, elaborate responses, and often ambiguous documentation. Current benchmarks for tool testing do not adequately address these complexities, leading to a critical gap in evaluating API readiness for agent-driven automation. In this work, we present a novel testing framework aimed at evaluating and enhancing the readiness of REST APIs to function as tools for LLM-based agents. Our framework transforms apis as tools, generates comprehensive test cases for the APIs, translates tests cases into natural language instructions suitable for agents, enriches tool definitions and evaluates the agent's ability t correctly invoke the API and process its inputs and responses. To provide actionable insights, we analyze the outcomes of 750 test cases, presenting a detailed taxonomy of errors, including input misinterpretation, output handling inconsistencies, and schema mismatches. Additionally, we classify these test cases to streamline debugging and refinement of tool integrations. This work offers a foundational step toward enabling enterprise APIs as tools, improving their usability in agent-based applications.