Clean Code, Better Models: Enhancing LLM Performance with Smell-Cleaned Dataset

📅 2025-08-16

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This study presents the first systematic investigation into how code smells—structural deficiencies in source code—affect the code generation performance of large language models (LLMs), highlighting a critical oversight in prior work: the neglect of training data quality. We propose a fine-grained code smell detection framework coupled with an LLM-driven automated refactoring pipeline to construct a high-quality, smell-free training dataset, and implement these techniques in the open-source tool SmellCC. Using DeepSeek-V2 and Qwen-Coder, we conduct controlled fine-tuning and evaluation on code completion and code search tasks. Results demonstrate that models trained on cleaned data achieve significant improvements in logical correctness, maintainability, and readability of generated code. Our core contribution is establishing a causal link between input data quality and model output quality, thereby introducing a novel paradigm for data governance in code LMs and providing a reusable, empirically validated technical pathway for dataset curation.

Technology Category

Application Category

📝 Abstract

The Large Language Models (LLMs) have demonstrated great potential in code-related tasks. However, most research focuses on improving the output quality of LLMs (e.g., correctness), and less attention has been paid to the LLM input (e.g., the training code quality). Given that code smells are widely existed in practice and can negatively impact software maintainability and readability, this study takes the first systematic research to assess and improve dataset quality in terms of code smells. In this work, we first conduct a preliminary study to explore the presence of code smells in a popular benchmark dataset (i.e., CodeSearchNet-Python}) and evaluate the output of several popular LLMs (i.e., DeepSeek-Coder, CodeLlama, and MagiCoder), revealing that code smell issues extensively exist in LLM's input (e.g., benchmark dataset) and output (e.g., generated code). We then conduct our systematic research by taking three main steps: Firstly, we propose an LLM-based code smell cleaning tool, named SmellCC, which automatically refactors and removes code smells. To evaluate the correctness of the code refactoring, we construct a test set of 50 repositories sourced from the CodeSearchNet-Python benchmark for functional testing. Then we apply our curated smell-cleaned dataset to fine-tune two LLMs (i.e., DeepSeek-V2 and Qwen-Coder) to explore their potential for generating high-quality code. Thirdly, we investigate the impact of code smells on two downstream tasks: code completion and code search. Lastly, we derive several actionable implications for software engineering researchers and industry practitioners from our findings.

Problem

Research questions and friction points this paper is trying to address.

Assessing and improving dataset quality regarding code smells

Exploring impact of code smells on LLM-generated code quality

Investigating code smell effects on code completion and search

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based tool SmellCC cleans code smells

Smell-cleaned dataset fine-tunes LLMs

Investigates smell impact on downstream tasks

🔎 Similar Papers

No similar papers found.