Evaluating LLMs on Sequential API Call Through Automated Test Generation

๐Ÿ“… 2025-07-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM-based tool-use evaluation suffers from three key limitations: (1) reliance on manually crafted test cases, (2) absence of automated semantic correctness verification, and (3) neglect of dynamic state interactions across sequential API calls. To address these, we propose StateGenโ€”the first state-machine-driven framework integrating constraint solving and control-flow injection, enabling dual-LLM collaboration to jointly generate executable code and verifiable natural-language task specifications. Based on StateGen, we construct StateEval, a benchmark comprising 120 carefully designed test cases that cover state evolution, conditional branching, and cross-step dependencies. Experimental results demonstrate that our approach substantially enhances test realism and difficulty, systematically exposing critical deficiencies in current LLMsโ€™ semantic reasoning and state persistence capabilities when executing complex, multi-step API sequences.

Technology Category

Application Category

๐Ÿ“ Abstract
By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on sequential API call interactions
Automating test generation for semantic correctness validation
Addressing gaps in benchmarks for complex API task scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

StateGen automates diverse sequential API task generation
Combines constraint solving, sampling, and control-flow injection
Translates programs to natural language via LLM collaboration
๐Ÿ”Ž Similar Papers
No similar papers found.
Yuheng Huang
Yuheng Huang
Cedars-Sinai Medical Center
CMR
Da Song
Da Song
CIFAR AI Safety Postdoctoral Fellow
Software EngineeringLarge Language ModelQuality AssuranceHCI
Zhenlan Ji
Zhenlan Ji
The Hong Kong University of Science and Technology
Software Engineering
S
Shuai Wang
Hong Kong University of Science and Technology, Hong Kong, China
L
Lei Ma
The University of Tokyo, Tokyo, Japan; University of Alberta, Edmonton, AB, Canada