Agentic SLMs: Hunting Down Test Smells

📅 2025-04-09

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

Testing smells undermine test reliability and maintainability. Existing approaches primarily focus on detection while neglecting automated repair, and often rely on heavyweight static analysis or large language models—resulting in high deployment barriers and poor generalizability. This paper proposes the first multi-agent workflow powered by small open-source language models (e.g., Phi-4 14B, Gemma 2 9B) for end-to-end detection and refactoring of cross-language testing smells in Java, Python, and Go. New smell types can be introduced via natural-language specifications alone. Our approach innovatively integrates lightweight static analysis with generative refactoring, enabling low-resource execution (single consumer-grade GPU), high accuracy (96% detection pass@5, 75.3% refactoring pass@5), and strong cross-language generalization. We have contributed 10 pull requests to real-world projects, with 5 successfully merged—demonstrating practical engineering efficacy.

Technology Category

Application Category

📝 Abstract

Test smells can compromise the reliability of test suites and hinder software maintenance. Although several strategies exist for detecting test smells, few address their removal. Traditional methods often rely on static analysis or machine learning, requiring significant effort and expertise. This study evaluates LLAMA 3.2 3B, GEMMA 2 9B, DEEPSEEK-R1 14B, and PHI 4 14B - small, open language models - for automating the detection and refactoring of test smells through agent-based workflows. We explore workflows with one, two, and four agents across 150 instances of 5 common test smell types extracted from real-world Java projects. Unlike prior approaches, ours is easily extensible to new smells via natural language definitions and generalizes to Python and Golang. All models detected nearly all test smell instances (pass@5 of 96% with four agents), with PHI 4 14B achieving the highest refactoring accuracy (pass@5 of 75.3%). Analyses were computationally inexpensive and ran efficiently on a consumer-grade hardware. Notably, PHI 4 14B with four agents performed within 5% of proprietary models such as O1-MINI, O3-MINI-HIGH, and GEMINI 2.5 PRO EXPERIMENTAL using a single agent. Multi-agent setups outperformed single-agent ones in three out of five test smell types, highlighting their potential to improve software quality with minimal developer effort. For the Assertion Roulette smell, however, a single agent performed better. To assess practical relevance, we submitted 10 pull requests with PHI 4 14B - generated code to open-source projects. Five were merged, one was rejected, and four remain under review, demonstrating the approach's real-world applicability.

Problem

Research questions and friction points this paper is trying to address.

Automating detection and refactoring of test smells

Evaluating small open language models for agent-based workflows

Improving software quality with minimal developer effort

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agent-based workflows for test smell detection

Natural language definitions for extensibility

Multi-agent setups improve refactoring accuracy

🔎 Similar Papers

The Emerged Security and Privacy of LLM Agent: A Survey with Case Studies