🤖 AI Summary
Existing code smell detection approaches—ranging from heuristic rules to machine learning (ML) and deep learning (DL)—exhibit limited performance, while full fine-tuning of large language models (LLMs) incurs prohibitive computational costs. This work presents the first systematic evaluation of parameter-efficient fine-tuning (PEFT) techniques—including Prompt Tuning, Prefix Tuning, LoRA, and (IA)³—for method-level code bad smells (e.g., Complex Conditional, Long Method) detection. We conduct experiments across four small language models and six LLMs using a high-quality GitHub dataset. Key findings: (1) Small models combined with PEFT significantly outperform both large models with PEFT and even fully fine-tuned LLMs; (2) Training data scale exerts a far greater impact on detection accuracy than the number of tunable parameters; (3) PEFT achieves comparable or superior accuracy to full fine-tuning while drastically reducing GPU memory consumption and consistently surpassing traditional detectors. Our study establishes a novel, lightweight, efficient, and deployable paradigm for code quality analysis.
📝 Abstract
Code smells are suboptimal coding practices that negatively impact the quality of software systems. Existing detection methods, relying on heuristics or Machine Learning (ML) and Deep Learning (DL) techniques, often face limitations such as unsatisfactory performance. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a resource-efficient approach for adapting LLMs to specific tasks, but their effectiveness for method-level code smell detection remains underexplored. In this regard, this study evaluates state-of-the-art PEFT methods on both small and large Language Models (LMs) for detecting two types of method-level code smells: Complex Conditional and Complex Method. Using high-quality datasets sourced from GitHub, we fine-tuned four small LMs and six LLMs with PEFT techniques, including prompt tuning, prefix tuning, LoRA, and (IA)3. Results show that PEFT methods achieve comparable or better performance than full fine-tuning while consuming less GPU memory. Notably, LLMs did not outperform small LMs, suggesting smaller models' suitability for this task. Additionally, increasing training dataset size significantly boosted performance, while increasing trainable parameters did not. Our findings highlight PEFT methods as effective and scalable solutions, outperforming existing heuristic-based and DL-based detectors.