Towards Understanding the Characteristics of Code Generation Errors Made by Large Language Models

📅 2024-06-13

📈 Citations: 5

✨ Influential: 0

career value

152K/year

🤖 AI Summary

The underlying mechanisms of code generation errors in large language models (LLMs) remain poorly understood. Method: Leveraging the HumanEval benchmark, this work systematically analyzes errors produced by six state-of-the-art LLMs and introduces, for the first time, a multidimensional, fine-grained error taxonomy integrating both semantic and syntactic dimensions. Using open coding and thematic analysis—augmented by statistical testing and qualitative root-cause attribution—the study identifies over ten recurrent error patterns, including logical flaws, boundary condition failures, and API misuse. Contribution/Results: The analysis reveals that LLM errors exhibit nontriviality, cross-line dependencies, and dispersed distribution—uncovering latent, deep-seated errors even in high-pass-rate tasks. It further demonstrates a nonlinear positive correlation between error frequency and task complexity. This taxonomy provides an interpretable, extensible theoretical foundation and empirical grounding for error localization, diagnosis, and repair in LLM-generated code.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated unprecedented capabilities in code generation. However, there remains a limited understanding of code generation errors that LLMs can produce. To bridge the gap, we conducted an in-depth analysis of code generation errors across six representative LLMs on the HumanEval dataset. Specifically, we first employed open coding and thematic analysis to distill a comprehensive taxonomy of code generation errors. We analyzed two dimensions of error characteristics -- semantic characteristics and syntactic characteristics. Our analysis revealed that LLMs often made non-trivial, multi-line code generation errors in various locations and with various root causes. We further analyzed the correlation between these errors and task complexity as well as test pass rate. Our findings highlighted several challenges in locating and fixing code generation errors made by LLMs. In the end, we discussed several future directions to address these challenges.

Problem

Research questions and friction points this paper is trying to address.

Analyze code generation errors by LLMs

Classify semantic and syntactic error characteristics

Explore error correlation with task complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open coding for error taxonomy

Thematic analysis of error characteristics

Correlation of errors with task complexity

🔎 Similar Papers

Fixing Function-Level Code Generation Errors for Foundation Large Language Models