Beyond Strict Rules: Assessing the Effectiveness of Large Language Models for Code Smell Detection

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

This study addresses the limitations of traditional static analysis tools, whose rigid rules struggle to effectively detect diverse code smells, thereby compromising software maintainability. It presents the first systematic evaluation of large language models—including DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code—in detecting nine categories of code smells across 30 real-world Java projects annotated by developers. The authors propose an adaptive hybrid strategy that dynamically combines large language models with static analysis, selecting the optimal detection method based on whether precision or recall is prioritized. Experimental results show that large language models excel at identifying structurally simple smells such as Large Class and Long Method. The hybrid approach achieves the highest F1 scores in five out of nine smell categories, significantly improving overall detection performance, though it still faces challenges with false positives in more complex code smells.

Technology Category

Application Category

📝 Abstract

Code smells are symptoms of potential code quality problems that may affect software maintainability, thus increasing development costs and impacting software reliability. Large language models (LLMs) have shown remarkable capabilities for supporting various software engineering activities, but their use for detecting code smells remains underexplored. However, unlike the rigid rules of static analysis tools, LLMs can support flexible and adaptable detection strategies tailored to the unique properties of code smells. This paper evaluates the effectiveness of four LLMs -- DeepSeek-R1, GPT-5 mini, Llama-3.3, and Qwen2.5-Code -- for detecting nine code smells across 30 Java projects. For the empirical evaluation, we created a ground-truth dataset by asking 76 developers to manually inspect 268 code-smell candidates. Our results indicate that LLMs perform strongly for structurally straightforward smells, such as Large Class and Long Method. However, we also observed that different LLMs and tools fare better for distinct code smells. We then propose and evaluate a detection strategy that combines LLMs and static analysis tools. The proposed strategy outperforms LLMs and tools in five out of nine code smells in terms of F1-Score. However, it also generates more false positives for complex smells. Therefore, we conclude that the optimal strategy depends on whether Recall or Precision is the main priority for code smell detection.

Problem

Research questions and friction points this paper is trying to address.

code smell detection

large language models

software maintainability

static analysis tools

empirical evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

large language models

code smell detection

static analysis