Evaluating the Effectiveness of LLMs in Fixing Maintainability Issues in Real-World Projects

📅 2025-02-04

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This study presents the first systematic evaluation of large language models (LLMs) in repairing real-world code maintainability issues across 10 GitHub repositories, encompassing 127 actual defects. We employ zero-shot and few-shot prompting (using Copilot Chat and Llama 3.1), coupled with compilation testing, static maintainability analysis, and a double-blind human readability study (N=45). Results show that few-shot Llama achieves a 44.9% repair success rate, and 68.63% of participants rated the generated code as more readable; however, over half of the repairs introduce compilation errors or new maintainability defects. To address this trade-off, we propose the first multi-dimensional evaluation framework that jointly assesses correctness and maintainability. Our findings reveal a critical paradox: improved readability often coincides with degraded reliability. This work establishes a methodological foundation and empirical evidence for maintainability-aware code generation research.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have gained attention for addressing coding problems, but their effectiveness in fixing code maintainability remains unclear. This study evaluates LLMs capability to resolve 127 maintainability issues from 10 GitHub repositories. We use zero-shot prompting for Copilot Chat and Llama 3.1, and few-shot prompting with Llama only. The LLM-generated solutions are assessed for compilation errors, test failures, and new maintainability problems. Llama with few-shot prompting successfully fixed 44.9% of the methods, while Copilot Chat and Llama zero-shot fixed 32.29% and 30%, respectively. However, most solutions introduced errors or new maintainability issues. We also conducted a human study with 45 participants to evaluate the readability of 51 LLM-generated solutions. The human study showed that 68.63% of participants observed improved readability. Overall, while LLMs show potential for fixing maintainability issues, their introduction of errors highlights their current limitations.

Problem

Research questions and friction points this paper is trying to address.

LLMs effectiveness in code maintainability

Evaluating LLMs on real-world GitHub issues

Assessing LLM-generated solutions for errors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Zero-shot prompting for LLMs

Few-shot prompting with Llama

Human study on solution readability

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair