Quality Assessment of Python Tests Generated by Large Language Models

📅 2025-06-17

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This study systematically evaluates the quality of Python test code generated by GPT-4o, Amazon Q, and Llama 3.3, focusing on structural reliability, error types (e.g., assertion failures), and test smells (e.g., lack of cohesion among test cases), while investigating their association with design pattern misuse. We propose the first empirically grounded test quality assessment framework, integrating error classification, test smell detection, design pattern consistency verification, and cross-model/cross-prompting comparison (Text2Code vs. Code2Code). Our analysis reveals a strong positive correlation between errors and test smells (r = 0.82)—a novel empirical finding. We further identify a trade-off: increased prompt specificity improves correctness but exacerbates test smell prevalence. Results show GPT-4o achieves the lowest error rates (6% in Text2Code, 10% in Code2Code); assertion failures constitute 64% of all errors, and “Lack of Cohesion of Test Cases” is the most prevalent test smell (41%).

Technology Category

Application Category

📝 Abstract

The manual generation of test scripts is a time-intensive, costly, and error-prone process, indicating the value of automated solutions. Large Language Models (LLMs) have shown great promise in this domain, leveraging their extensive knowledge to produce test code more efficiently. This study investigates the quality of Python test code generated by three LLMs: GPT-4o, Amazon Q, and LLama 3.3. We evaluate the structural reliability of test suites generated under two distinct prompt contexts: Text2Code (T2C) and Code2Code (C2C). Our analysis includes the identification of errors and test smells, with a focus on correlating these issues to inadequate design patterns. Our findings reveal that most test suites generated by the LLMs contained at least one error or test smell. Assertion errors were the most common, comprising 64% of all identified errors, while the test smell Lack of Cohesion of Test Cases was the most frequently detected (41%). Prompt context significantly influenced test quality; textual prompts with detailed instructions often yielded tests with fewer errors but a higher incidence of test smells. Among the evaluated LLMs, GPT-4o produced the fewest errors in both contexts (10% in C2C and 6% in T2C), whereas Amazon Q had the highest error rates (19% in C2C and 28% in T2C). For test smells, Amazon Q had fewer detections in the C2C context (9%), while LLama 3.3 performed best in the T2C context (10%). Additionally, we observed a strong relationship between specific errors, such as assertion or indentation issues, and test case cohesion smells. These findings demonstrate opportunities for improving the quality of test generation by LLMs and highlight the need for future research to explore optimized generation scenarios and better prompt engineering strategies.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Python test quality from LLMs like GPT-4o, Amazon Q, Llama 3.3

Analyzing errors and test smells linked to poor design patterns

Assessing prompt context impact on test reliability and cohesion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates Python test quality from GPT-4o, Amazon Q, LLama 3.3

Analyzes errors and smells in Text2Code and Code2Code prompts

Identifies assertion errors and cohesion smells as major issues

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation