Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

📅 2025-07-17

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Manual test cases frequently exhibit test smells—such as ambiguity, redundancy, and missing assertions—that undermine their reliability and maintainability; existing detection tools rely on handcrafted rules, limiting scalability and adaptability. Method: This paper pioneers the application of small language models (SLMs)—including Gemma-3, Llama-3.2, and Phi-4—to automatically detect and explain seven common test smells and generate actionable repair suggestions, without predefined rules, in an end-to-end, privacy-preserving, and low-resource deployment setting. Evaluation is conducted on real-world Ubuntu test cases. Contribution/Results: Experimental results demonstrate that Phi-4 achieves a 97% pass@2 accuracy—significantly outperforming Gemma-3 and Llama-3.2 (both at 91%)—while autonomously identifying smells and producing human-interpretable feedback. Our approach establishes a novel, lightweight, and explainable paradigm for test quality assurance.

Technology Category

Application Category

📝 Abstract

Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Detecting test smells in manual test cases automatically

Evaluating Small Language Models for test smell detection

Improving test reliability and maintainability without manual rules

Innovation

Methods, ideas, or system contributions that make the work stand out.

SLMs detect test smells automatically

SLMs explain issues and suggest improvements

SLMs enable low-cost concept-driven identification

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation