ProjectTest: A Project-level Unit Test Generation Benchmark and Impact of Error Fixing Mechanisms

📅 2025-02-10

📈 Citations: 0

✨ Influential: 0

career value

114K/year

🤖 AI Summary

Existing unit test generation benchmarks are limited to function- or class-level code, failing to capture the complexity of real-world software projects. To address this gap, we introduce ProjectTest—the first cross-language (Python/Java/JavaScript), project-level benchmark comprising 60 medium-sized, high-quality open-source projects. We establish a rigorous project-level test generation evaluation paradigm and systematically assess nine state-of-the-art large language models (LLMs) under zero-shot settings. Our analysis reveals systematic deficiencies in LLM-generated tests, particularly concerning compilation failures and cascading errors. Empirical results demonstrate that both human-guided correction and model-based self-repair significantly improve test pass rates—yielding average gains exceeding 40%. This work fills a critical void in project-level evaluation, uncovers key engineering bottlenecks in LLM-driven test generation, and empirically validates the substantial performance gains achievable through targeted error correction mechanisms.

Technology Category

Application Category

📝 Abstract

Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant simple errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms.

Problem

Research questions and friction points this paper is trying to address.

Project-level unit test generation

Evaluation of LLM capabilities

Impact of error-fixing mechanisms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Project-level unit test benchmark

Evaluation of nine frontier LLMs

Manual and self-error-fixing mechanisms

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark