🤖 AI Summary
This study systematically evaluates the effectiveness of large language models (LLMs) in automated Java unit test generation, comparing LLM-based TestSpark against search-based (EvoSuite) and symbolic execution–based (Kex) approaches on the GitBug dataset using multidimensional metrics: code coverage, mutation score, defect detection rate, and code characteristics. Methodologically, it conducts controlled empirical analysis across diverse unit-under-test (CUT) sizes and complexities. Key contributions include: (1) first empirical evidence that while LLM-generated tests achieve lower line/branch coverage, they attain significantly higher mutation scores—indicating superior semantic understanding and logical reasoning; (2) identification of CUT size and complexity as the dominant performance bottleneck for LLMs, far exceeding sensitivity observed in traditional tools; and (3) confirmation that all three approaches are constrained by internal CUT dependencies, yet LLMs uniquely excel in generating semantically rich, high-quality test cases—providing critical empirical grounding and actionable directions for LLM-driven testing research and practice.
📝 Abstract
Generating tests automatically is a key and ongoing area of focus in software engineering research. The emergence of Large Language Models (LLMs) has opened up new opportunities, given their ability to perform a wide spectrum of tasks. However, the effectiveness of LLM-based approaches compared to traditional techniques such as search-based software testing (SBST) and symbolic execution remains uncertain. In this paper, we perform an extensive study of automatic test generation approaches based on three tools: EvoSuite for SBST, Kex for symbolic execution, and TestSpark for LLM-based test generation. We evaluate tools performance on the GitBug Java dataset and compare them using various execution-based and feature-based metrics. Our results show that while LLM-based test generation is promising, it falls behind traditional methods in terms of coverage. However, it significantly outperforms them in mutation scores, suggesting that LLMs provide a deeper semantic understanding of code. LLM-based approach also performed worse than SBST and symbolic execution-based approaches w.r.t. fault detection capabilities. Additionally, our feature-based analysis shows that all tools are primarily affected by the complexity and internal dependencies of the class under test (CUT), with LLM-based approaches being especially sensitive to the CUT size.