Test smells in LLM-Generated Unit Tests

📅 2024-10-14

🏛️ arXiv.org

📈 Citations: 4

✨ Influential: 0

career value

140K/year

🤖 AI Summary

This study addresses the understudied issue of test smells in large language model (LLM)-generated unit tests. Method: We conduct the first large-scale empirical analysis of test smells across 20,505 Java class-level test suites drawn from five sources: human-written tests, EvoSuite-generated tests, and LLM-generated tests from GPT-3.5, GPT-4, Mistral, and Mixtral. Our multi-benchmark, cross-model analysis framework encompasses over 770,000 test cases, leveraging dual smell-detection tools—TsDetect and JNose—across 34,635 open-source projects and the TestBench benchmark. Contribution/Results: We identify prevalent smells—including Assertion Roulette and Magic Number Test—in LLM-generated tests; their occurrence patterns are significantly influenced by prompting strategies, context length, and model scale. Notably, LLM-generated tests exhibit smell profiles closer to human-written tests than to search-based software testing (SBST) outputs, suggesting potential training data contamination. These findings provide critical empirical grounding for developing smell-aware test generation frameworks.

Technology Category

Application Category

📝 Abstract

LLMs promise to transform unit test generation from a manual burden into an automated solution. Yet, beyond metrics such as compilability or coverage, little is known about the quality of LLM-generated tests, particularly their susceptibility to test smells, design flaws that undermine readability and maintainability. This paper presents the first multi-benchmark, large-scale analysis of test smell diffusion in LLM-generated unit tests. We contrast LLM outputs with human-written suites (as the reference for real-world practices) and SBST-generated tests from EvoSuite (as the automated baseline), disentangling whether LLMs reproduce human-like flaws or artifacts of synthetic generation. Our study draws on 20,505 class-level suites from four LLMs (GPT-3.5, GPT-4, Mistral 7B, Mixtral 8x7B), 972 method-level cases from TestBench, 14,469 EvoSuite tests, and 779,585 human-written tests from 34,635 open-source Java projects. Using two complementary detection tools (TsDetect and JNose), we analyze prevalence, co-occurrence, and correlations with software attributes and generation parameters. Results show that LLM-generated tests consistently manifest smells such as Assertion Roulette and Magic Number Test, with patterns strongly influenced by prompting strategy, context length, and model scale. Comparisons reveal overlaps with human-written tests, raising concerns of potential data leakage from training corpora while EvoSuite exhibits distinct, generator-specific flaws. These findings highlight both the promise and the risks of LLM-based test generation, and call for the design of smell-aware generation frameworks, prompt engineering strategies, and enhanced detection tools to ensure maintainable, high-quality test code.

Problem

Research questions and friction points this paper is trying to address.

Analyzing test smell diffusion in LLM-generated unit tests across multiple benchmarks

Comparing LLM outputs with human-written tests and SBST-generated EvoSuite tests

Investigating how generation parameters influence test quality and maintainability issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed test smell diffusion in LLM-generated unit tests

Compared LLM outputs with human-written and EvoSuite tests

Identified smell patterns influenced by prompting and model scale

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation