Fixing Function-Level Code Generation Errors for Foundation Large Language Models

📅 2024-09-01

📈 Citations: 0

✨ Influential: 0

career value

151K/year

🤖 AI Summary

Large language models (LLMs) frequently generate function-level code containing diverse errors, yet the root causes remain poorly understood, and existing repair approaches rely heavily on costly re-invocations of LLMs. Method: We conduct the first large-scale empirical study to systematically categorize 19 function-level error types and identify three high-frequency, rule-amenable classes—indentation errors, redundant code, and missing imports. Based on this analysis, we propose LlmFix: a lightweight, rule-driven, static-analysis-based method that performs deterministic corrections without additional LLM calls. We further introduce LlmErrorEval, an open-source evaluation framework enabling fine-grained error classification and quantitative assessment. Results: On LlmErrorEval, LlmFix achieves a 17.1% repair rate—outperforming the best LLM-based post-processing baseline by 8.9%. When applied to HumanEval and MBPP, it improves the pass@1 accuracy of 14 mainstream LLMs by an average of 7.5%.

Technology Category

Application Category

📝 Abstract

Function-level code generation leverages foundation Large Language Models (LLMs) to automatically produce source code with expected functionality. It has been widely investigated and applied in intelligent programming assistants, such as GitHub Copilot, to enhance software development productivity. Despite advancements in foundation LLMs, the generation involves many errors. Existing studies leverage static analysis tools (e.g., TBar) or add another fixing LLM (i.e., LDB) to post-process these errors. However, there are still many errors remaining to be solved because their root causes have not been investigated yet, making it challenging to design better fixing tools. In this paper, we first conducted an empirical study on the generation errors. Specifically, we reproduced 14 representative LLMs on the HumanEval dataset and verified their correctness. We obtained 12,837 code generation errors and conducted an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Based on the findings, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Evaluations of LlmFix are conducted from two perspectives: its performance on error-fixing tasks and its impact on improving function-level code generation tasks. For error fixing performance, we built an evaluation dataset LlmErrorEval. Experimental results show that LlmFix achieves a fix rate of 17.1% outperforming the best LDB by 8.9%. For code generation improvements, evaluations of LlmFix on both the HumanEval and MBPP datasets demonstrate its effectiveness, improving code generation accuracy by an average of 7.5% across 14 LLMs.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

Code Generation

Accuracy Improvement

Innovation

Methods, ideas, or system contributions that make the work stand out.

LlmFix

Code Generation Correction

Large Language Model Optimization

🔎 Similar Papers

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging