🤖 AI Summary
This study addresses the understudied issue of test smells in large language model (LLM)-generated unit tests. Method: We conduct the first large-scale empirical analysis of test smells across 20,505 Java class-level test suites drawn from five sources: human-written tests, EvoSuite-generated tests, and LLM-generated tests from GPT-3.5, GPT-4, Mistral, and Mixtral. Our multi-benchmark, cross-model analysis framework encompasses over 770,000 test cases, leveraging dual smell-detection tools—TsDetect and JNose—across 34,635 open-source projects and the TestBench benchmark. Contribution/Results: We identify prevalent smells—including Assertion Roulette and Magic Number Test—in LLM-generated tests; their occurrence patterns are significantly influenced by prompting strategies, context length, and model scale. Notably, LLM-generated tests exhibit smell profiles closer to human-written tests than to search-based software testing (SBST) outputs, suggesting potential training data contamination. These findings provide critical empirical grounding for developing smell-aware test generation frameworks.
📝 Abstract
LLMs promise to transform unit test generation from a manual burden into an automated solution. Yet, beyond metrics such as compilability or coverage, little is known about the quality of LLM-generated tests, particularly their susceptibility to test smells, design flaws that undermine readability and maintainability. This paper presents the first multi-benchmark, large-scale analysis of test smell diffusion in LLM-generated unit tests. We contrast LLM outputs with human-written suites (as the reference for real-world practices) and SBST-generated tests from EvoSuite (as the automated baseline), disentangling whether LLMs reproduce human-like flaws or artifacts of synthetic generation. Our study draws on 20,505 class-level suites from four LLMs (GPT-3.5, GPT-4, Mistral 7B, Mixtral 8x7B), 972 method-level cases from TestBench, 14,469 EvoSuite tests, and 779,585 human-written tests from 34,635 open-source Java projects. Using two complementary detection tools (TsDetect and JNose), we analyze prevalence, co-occurrence, and correlations with software attributes and generation parameters. Results show that LLM-generated tests consistently manifest smells such as Assertion Roulette and Magic Number Test, with patterns strongly influenced by prompting strategy, context length, and model scale. Comparisons reveal overlaps with human-written tests, raising concerns of potential data leakage from training corpora while EvoSuite exhibits distinct, generator-specific flaws. These findings highlight both the promise and the risks of LLM-based test generation, and call for the design of smell-aware generation frameworks, prompt engineering strategies, and enhanced detection tools to ensure maintainable, high-quality test code.