RESTestBench: A Benchmark for Evaluating the Effectiveness of LLM-Generated REST API Test Cases from NL Requirements

📅 2026-04-28

📈 Citations: 0

✨ Influential: 0

career value

133K/year

🤖 AI Summary

This work addresses the challenge of evaluating the effectiveness of REST API test cases generated from natural language requirements, a task inadequately supported by traditional metrics such as code coverage. The authors propose the first benchmark specifically designed for assessing REST API testing against natural language specifications, comprising three services with both precise and ambiguous requirement variants. They introduce a requirement-based mutation testing metric to quantify test effectiveness and integrate large language models, natural language processing, and property-based testing within a refinement mechanism that interacts with the system under test. Experimental results demonstrate that highly detailed requirements enable the generation of effective tests without reliance on implementation artifacts; however, when requirements are ambiguous, interaction with faulty or mutated systems substantially degrades test effectiveness, revealing the critical influence of system feedback on test generation quality.

📝 Abstract

Existing REST API testing tools are typically evaluated using code coverage and crash-based fault metrics. However, recent LLM-based approaches increasingly generate tests from NL requirements to validate functional behaviour, making traditional metrics weak proxies for whether generated tests validate intended behaviour. To address this gap, we present RESTestBench, a benchmark comprising three REST services paired with manually verified NL requirements in both precise and vague variants, enabling controlled and reproducible evaluation of requirement-based test generation. RESTestBench further introduces a requirements-based mutation testing metric that measures the fault-detection effectiveness of a generated test case with respect to a specific requirement, extending the property-based approach of Bartocci et al. . Using RESTestBench, we evaluate two approaches across multiple state-of-the-art LLMs: (i) non-refinement-based generation, and (ii) refinement-based generation guided by interaction with the running SUT. In the refinement experiments, RESTestBench assesses how exposure to the actual implementation, valid or mutated, affects test effectiveness. Our results show that test effectiveness drops considerably when the generator interacts with faulty or mutated code, especially for vague requirements, sometimes negating the benefit of refinement and indicating that incorporating actual SUT behaviour is unnecessary when requirement detail is high.

Problem

Research questions and friction points this paper is trying to address.

REST API testing

LLM-generated test cases

natural language requirements

test effectiveness evaluation

requirement-based testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

REST API testing

LLM-generated test cases

natural language requirements