🤖 AI Summary
Developers often neglect refactoring due to resource constraints and lack of immediate returns, while existing automated tools support only a limited set of refactorings. To address this, we propose a novel paradigm that guides large language models (LLMs) using human best practices—specifically, Fowler’s refactoring catalog—to perform diverse, fine-grained refactorings across 61 distinct types. Our approach innovatively integrates descriptive and rule-based instruction strategies, shifting from rigid pattern matching to goal-oriented, semantics-preserving transformation. We evaluate our method using models including GPT-mini and DeepSeek-V3 on both benchmark datasets and real-world GitHub projects, achieving full-coverage refactoring with high semantic fidelity. Rule-based instructions significantly outperform baselines in complex logical refactorings. This work establishes a systematic, empirically grounded methodology for LLM-driven refactoring that is high-quality, interpretable, and broadly applicable.
📝 Abstract
Code refactoring is a fundamental software engineering practice aimed at improving code quality and maintainability. Despite its importance, developers often neglect refactoring due to the significant time, effort, and resources it requires, as well as the lack of immediate functional rewards. Although several automated refactoring tools have been proposed, they remain limited in supporting a broad spectrum of refactoring types. In this study, we explore whether instruction strategies inspired by human best-practice guidelines can enhance the ability of Large Language Models (LLMs) to perform diverse refactoring tasks automatically. Leveraging the instruction-following and code comprehension capabilities of state-of-the-art LLMs (e.g., GPT-mini and DeepSeek-V3), we draw on Martin Fowler's refactoring guidelines to design multiple instruction strategies that encode motivations, procedural steps, and transformation objectives for 61 well-known refactoring types. We evaluate these strategies on benchmark examples and real-world code snippets from GitHub projects. Our results show that instruction designs grounded in Fowler's guidelines enable LLMs to successfully perform all benchmark refactoring types and preserve program semantics in real-world settings, an essential criterion for effective refactoring. Moreover, while descriptive instructions are more interpretable to humans, our results show that rule-based instructions often lead to better performance in specific scenarios. Interestingly, allowing models to focus on the overall goal of refactoring, rather than prescribing a fixed transformation type, can yield even greater improvements in code quality.