TESTEVAL: Benchmarking Large Language Models for Test Case Generation

📅 2024-06-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

Large language models (LLMs) exhibit significant limitations in directed software test generation—particularly for fine-grained structural coverage (e.g., line, branch, and path coverage)—yet no standardized benchmark exists to rigorously evaluate their capabilities in this domain. Method: We introduce TESTEVAL, the first standardized benchmark for LLM-based directed test generation, comprising 210 LeetCode Python programs and defining three coverage tasks: overall, line-level, and path-level. Our framework employs a fine-grained, logic-aware evaluation methodology that integrates static program analysis with dynamic execution validation. Contribution/Results: We conduct the first systematic evaluation of 16 state-of-the-art LLMs on TESTEVAL. Results reveal severe deficiencies in LLMs’ performance on directed path coverage, exposing fundamental weaknesses in their understanding of program execution semantics and control-flow logic. To foster reproducible and extensible research in AI-driven software testing, we publicly release the full dataset, evaluation toolchain, and experimental results.

Technology Category

Application Category

📝 Abstract

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program under test. Recently, researchers have recognized the potential of large language models (LLMs) in software testing. However, there remains a lack of fair comparisons between different LLMs in terms of test case generation capabilities. In this paper, we propose TESTEVAL, a novel benchmark for test case generation with LLMs. We collect 210 Python programs from an online programming platform, LeetCode, and design three different tasks: overall coverage, targeted line/branch coverage, and targeted path coverage. We further evaluate sixteen popular LLMs, including both commercial and open-source ones, on TESTEVAL. We find that generating test cases to cover specific program lines/branches/paths is still challenging for current LLMs, indicating a lack of ability to comprehend program logic and execution paths. We have open-sourced our dataset and benchmark pipelines at https://github.com/LLM4SoftwareTesting/TestEval.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Software Testing

Code Coverage

Innovation

Methods, ideas, or system contributions that make the work stand out.

TESTEVAL

Language Model Evaluation

Software Testing

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark