🤖 AI Summary
Large language models (LLMs) frequently generate function-level code containing diverse errors, yet the root causes remain poorly understood, and existing repair approaches rely heavily on costly re-invocations of LLMs. Method: We conduct the first large-scale empirical study to systematically categorize 19 function-level error types and identify three high-frequency, rule-amenable classes—indentation errors, redundant code, and missing imports. Based on this analysis, we propose LlmFix: a lightweight, rule-driven, static-analysis-based method that performs deterministic corrections without additional LLM calls. We further introduce LlmErrorEval, an open-source evaluation framework enabling fine-grained error classification and quantitative assessment. Results: On LlmErrorEval, LlmFix achieves a 17.1% repair rate—outperforming the best LLM-based post-processing baseline by 8.9%. When applied to HumanEval and MBPP, it improves the pass@1 accuracy of 14 mainstream LLMs by an average of 7.5%.
📝 Abstract
Function-level code generation leverages foundation Large Language Models (LLMs) to automatically produce source code with expected functionality. It has been widely investigated and applied in intelligent programming assistants, such as GitHub Copilot, to enhance software development productivity. Despite advancements in foundation LLMs, the generation involves many errors. Existing studies leverage static analysis tools (e.g., TBar) or add another fixing LLM (i.e., LDB) to post-process these errors. However, there are still many errors remaining to be solved because their root causes have not been investigated yet, making it challenging to design better fixing tools. In this paper, we first conducted an empirical study on the generation errors. Specifically, we reproduced 14 representative LLMs on the HumanEval dataset and verified their correctness. We obtained 12,837 code generation errors and conducted an analysis of their causes, leading to 19 categories of error causes. Our empirical analysis indicated that three of these causes can be directly fixed. Based on the findings, we proposed a fixing method called LlmFix, which addresses these three types of errors through a three-step process: filtering code for indentation correction, truncating redundant generated code, and importing missing modules. Evaluations of LlmFix are conducted from two perspectives: its performance on error-fixing tasks and its impact on improving function-level code generation tasks. For error fixing performance, we built an evaluation dataset LlmErrorEval. Experimental results show that LlmFix achieves a fix rate of 17.1% outperforming the best LDB by 8.9%. For code generation improvements, evaluations of LlmFix on both the HumanEval and MBPP datasets demonstrate its effectiveness, improving code generation accuracy by an average of 7.5% across 14 LLMs.