Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically evaluates the effectiveness of large language models (LLMs) in end-to-end detection and automated repair of test smells. Addressing the limitation of existing tools—emphasizing re-detection over safe, semantics-preserving refactoring—we propose a hybrid framework integrating static smell detectors (PyNose/TsDetect) with LLM-driven refactoring. We conduct the first empirical, cross-language comparison of GPT-4-Turbo, LLaMA-3 70B, and Gemini-1.5 Pro on Python and Java test code. Results show that Gemini-1.5 Pro achieves the highest detection accuracy (74.35% for Python; 80.32% for Java) and is the only model that significantly improves test coverage post-refactoring. In contrast, other models generate syntactically valid refactorings but frequently introduce new smells or reduce coverage. This work reveals critical disparities in LLM capabilities for test quality assurance and establishes the first cross-language, empirically validated benchmark for LLM-augmented test maintenance—providing actionable insights for practitioners and researchers alike.

Technology Category

Application Category

📝 Abstract
Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites, using PyNose and TsDetect for initial smell detection, followed by LLM-driven refactoring. Gemini achieved the highest detection accuracy (74.35% Python, 80.32% Java), while LLaMA was lowest. All models could refactor smells, but effectiveness varied, sometimes introducing new smells. Gemini also improved test coverage, unlike GPT-4 and LLaMA, which often reduced it. These results highlight LLMs' potential for automated test smell refactoring, with Gemini as the strongest performer, though challenges remain across languages and smell types.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to detect and refactor test smells
Comparing effectiveness of GPT-4, LLaMA 3, and Gemini in test smell correction
Assessing impact of LLM refactoring on test coverage and new smells
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs detect and refactor test smells automatically
Gemini achieves highest accuracy in smell detection
Gemini improves test coverage during refactoring
🔎 Similar Papers
No similar papers found.
E
Enio G. Santana
UFBA
J
Jander Pereira Santos Junior
UFBA
E
Erlon P. Almeida
UFBA
Iftekhar Ahmed
Iftekhar Ahmed
Associate Professor, University of California, Irvine
Software EngineeringSoftware TestingMachine Learning
P
Paulo Anselmo da Mota Silveira Neto
UFRPE
E
Eduardo Santana de Almeida
Senior Member, IEEE