🤖 AI Summary
This study presents the first systematic investigation into how code smells—structural deficiencies in source code—affect the code generation performance of large language models (LLMs), highlighting a critical oversight in prior work: the neglect of training data quality. We propose a fine-grained code smell detection framework coupled with an LLM-driven automated refactoring pipeline to construct a high-quality, smell-free training dataset, and implement these techniques in the open-source tool SmellCC. Using DeepSeek-V2 and Qwen-Coder, we conduct controlled fine-tuning and evaluation on code completion and code search tasks. Results demonstrate that models trained on cleaned data achieve significant improvements in logical correctness, maintainability, and readability of generated code. Our core contribution is establishing a causal link between input data quality and model output quality, thereby introducing a novel paradigm for data governance in code LMs and providing a reusable, empirically validated technical pathway for dataset curation.
📝 Abstract
The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. In this work, we first conduct a preliminary study to explore the presence of code smells in a popular benchmark dataset (i.e., CodeSearchNet-Python}) and evaluate the output of several popular LLMs (i.e., DeepSeek-Coder, CodeLlama, and MagiCoder), revealing that code smell issues extensively exist in LLM's input (e.g., benchmark dataset) and output (e.g., generated code). We then conduct our systematic research by taking three main steps: Firstly, we propose an LLM-based code smell cleaning tool, named SmellCC, which automatically refactors and removes code smells. To evaluate the correctness of the code refactoring, we construct a test set of 50 repositories sourced from the CodeSearchNet-Python benchmark for functional testing. Then we apply our curated smell-cleaned dataset to fine-tune two LLMs (i.e., DeepSeek-V2 and Qwen-Coder) to explore their potential for generating high-quality code. Thirdly, we investigate the impact of code smells on two downstream tasks: code completion and code search. Lastly, we derive several actionable implications for software engineering researchers and industry practitioners from our findings.