Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

📅 2025-07-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Manual test cases frequently exhibit test smells—such as ambiguity, redundancy, and missing assertions—that undermine their reliability and maintainability; existing detection tools rely on handcrafted rules, limiting scalability and adaptability. Method: This paper pioneers the application of small language models (SLMs)—including Gemma-3, Llama-3.2, and Phi-4—to automatically detect and explain seven common test smells and generate actionable repair suggestions, without predefined rules, in an end-to-end, privacy-preserving, and low-resource deployment setting. Evaluation is conducted on real-world Ubuntu test cases. Contribution/Results: Experimental results demonstrate that Phi-4 achieves a 97% pass@2 accuracy—significantly outperforming Gemma-3 and Llama-3.2 (both at 91%)—while autonomously identifying smells and producing human-interpretable feedback. Our approach establishes a novel, lightweight, and explainable paradigm for test quality assurance.

Technology Category

Application Category

📝 Abstract
Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Detecting test smells in manual test cases automatically
Evaluating Small Language Models for test smell detection
Improving test reliability and maintainability without manual rules
Innovation

Methods, ideas, or system contributions that make the work stand out.

SLMs detect test smells automatically
SLMs explain issues and suggest improvements
SLMs enable low-cost concept-driven identification
🔎 Similar Papers
No similar papers found.