🤖 AI Summary
Existing unit test generation benchmarks are limited to function- or class-level code, failing to capture the complexity of real-world software projects. To address this gap, we introduce ProjectTest—the first cross-language (Python/Java/JavaScript), project-level benchmark comprising 60 medium-sized, high-quality open-source projects. We establish a rigorous project-level test generation evaluation paradigm and systematically assess nine state-of-the-art large language models (LLMs) under zero-shot settings. Our analysis reveals systematic deficiencies in LLM-generated tests, particularly concerning compilation failures and cascading errors. Empirical results demonstrate that both human-guided correction and model-based self-repair significantly improve test pass rates—yielding average gains exceeding 40%. This work fills a critical void in project-level evaluation, uncovers key engineering bottlenecks in LLM-driven test generation, and empirically validates the substantial performance gains achievable through targeted error correction mechanisms.
📝 Abstract
Unit test generation has become a promising and important use case of LLMs. However, existing evaluation benchmarks for assessing LLM unit test generation capabilities focus on function- or class-level code rather than more practical and challenging project-level codebases. To address such limitation, we propose ProjectTest, a project-level benchmark for unit test generation covering Python, Java, and JavaScript. ProjectTest features 20 moderate-sized and high-quality projects per language. We evaluate nine frontier LLMs on ProjectTest and the results show that all frontier LLMs tested exhibit moderate performance on ProjectTest on Python and Java, highlighting the difficulty of ProjectTest. We also conduct a thorough error analysis, which shows that even frontier LLMs, such as Claude-3.5-Sonnet, have significant simple errors, including compilation and cascade errors. Motivated by this observation, we further evaluate all frontier LLMs under manual error-fixing and self-error-fixing scenarios to assess their potential when equipped with error-fixing mechanisms.