TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

📅 2024-10-01
🏛️ arXiv.org
📈 Citations: 9
Influential: 1
📄 PDF
🤖 AI Summary
Existing test generation benchmarks are severely inadequate, hindering rigorous evaluation and advancement of large language models (LLMs) in software testing. To address this, we introduce TestGenEval—the first large-scale, end-to-end benchmark for realistic Python unit test generation and completion. It encompasses three core tasks: from-scratch test writing, test augmentation, and coverage-driven test enhancement, grounded in 68,647 real-world test cases drawn from 11 high-quality open-source repositories. The benchmark integrates static analysis, dynamic coverage tracing, and multi-dimensional functional validation to enable standardized evaluation across models ranging from 7B to 405B parameters. Empirical results expose fundamental limitations of current LLMs—particularly in execution-aware reasoning and path-sensitive assertion generation—with GPT-4o achieving only 35.2% average coverage. This work establishes the first systematic, reproducible, and high-fidelity evaluation framework for intelligent test generation research.

Technology Category

Application Category

📝 Abstract
Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there has been far less effort dedicated to benchmarking software testing, despite the strong correlation between well-tested software and effective bug detection. To address this gap, we create and release TestGenEval, a large-scale benchmark to measure test generation performance. Based on SWEBench, TestGenEval comprises 68,647 tests from 1,210 code and test file pairs across 11 well-maintained Python repositories. It covers initial tests authoring, test suite completion, and code coverage improvements. Test authoring simulates the process of a developer writing a test suite from scratch, while test completion mimics the scenario where a developer aims to improve the coverage of an existing test suite. We evaluate several popular models, with sizes ranging from 7B to 405B parameters. Our detailed analysis highlights TestGenEval's contribution to a comprehensive evaluation of test generation performance. In particular, models struggle to generate high-coverage test suites, with the best model, GPT-4o, achieving an average coverage of only 35.2%. This is primarily due to models struggling to reason about execution, and their frequent assertion errors when addressing complex code paths.
Problem

Research questions and friction points this paper is trying to address.

Lack of benchmarks for software testing in code generation models.
Need for evaluating test generation performance in real-world scenarios.
Challenges in generating high-coverage test suites with existing models.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale benchmark for test generation
Covers test authoring and completion scenarios
Evaluates models on code coverage improvements
🔎 Similar Papers
No similar papers found.