🤖 AI Summary
Existing evaluations of Java test-generation tools rely on narrow, single-dimensional metrics, limiting insights into their practical utility. Method: This study systematically assesses EVOFUZZ, EVOSUITE, BBC, and RANDOOP across 55 real-world cross-project classes, introducing the first three-dimensional evaluation framework integrating code coverage, mutation coverage, and natural-language-inspired readability—quantified via syntactic complexity and identifier naming quality. A standardized benchmark suite is constructed, and evaluation combines static analysis, dynamic execution, and readability modeling. Contribution/Results: Empirical results reveal systematic trade-offs between coverage and maintainability: EVOFUZZ achieves +12.3% average mutation coverage over baselines, while RANDOOP produces the most readable test cases. The framework provides actionable, evidence-based guidance for industrial test-tool selection and advances methodology for holistic test-generation assessment.
📝 Abstract
This short report presents the 2025 edition of the Java Unit Testing Competition in which four test generation tools (EVOFUZZ, EVOSUITE, BBC, and RANDOOP) were benchmarked on a freshly selected set of 55 Java classes from six different open source projects. The benchmarking was based on structural metrics, such as code and mutation coverage of the classes under test, as well as on the readability of the generated test cases.