🤖 AI Summary
Manual test cases frequently exhibit test smells—such as ambiguity, redundancy, and missing assertions—that undermine their reliability and maintainability; existing detection tools rely on handcrafted rules, limiting scalability and adaptability.
Method: This paper pioneers the application of small language models (SLMs)—including Gemma-3, Llama-3.2, and Phi-4—to automatically detect and explain seven common test smells and generate actionable repair suggestions, without predefined rules, in an end-to-end, privacy-preserving, and low-resource deployment setting. Evaluation is conducted on real-world Ubuntu test cases.
Contribution/Results: Experimental results demonstrate that Phi-4 achieves a 97% pass@2 accuracy—significantly outperforming Gemma-3 and Llama-3.2 (both at 91%)—while autonomously identifying smells and producing human-interpretable feedback. Our approach establishes a novel, lightweight, and explainable paradigm for test quality assurance.
📝 Abstract
Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.