Hallucination by Code Generation LLMs: Taxonomy, Benchmarks, Mitigation, and Challenges

📅 2025-04-29

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Large language models (LLMs) frequently generate code hallucinations—plausible yet incorrect outputs—that are difficult to detect and rectify, thereby compromising code reliability. Method: We propose the first systematic taxonomy of code hallucinations tailored for CodeLLMs, identifying five canonical hallucination patterns; construct a multidimensional benchmark integrating HumanEval, an extended MBPP dataset, and execution-based metrics; and empirically evaluate hallucination rates across 12 state-of-the-art CodeLLMs, uncovering critical risks such as execution-path-dependent hallucinations. We further introduce three scalable mitigation paradigms—error-pattern mining, adversarial validation, and interpretability-guided analysis—and validate their efficacy. Contribution/Results: Our work systematically characterizes limitations of existing approaches, establishes a principled technical pathway spanning hallucination detection, localization, and elimination, and provides both a theoretical framework and practical guidelines for trustworthy code generation.

Technology Category

Application Category

📝 Abstract

Recent technical breakthroughs in large language models (LLMs) have enabled them to fluently generate source code. Software developers often leverage both general-purpose and code-specialized LLMs to revise existing code or even generate a whole function from scratch. These capabilities are also beneficial in no-code or low-code contexts, in which one can write programs without a technical background. However, due to their internal design, LLMs are prone to generating hallucinations, which are incorrect, nonsensical, and not justifiable information but difficult to identify its presence. This problem also occurs when generating source code. Once hallucinated code is produced, it is often challenging for users to identify and fix it, especially when such hallucinations can be identified under specific execution paths. As a result, the hallucinated code may remain unnoticed within the codebase. This survey investigates recent studies and techniques relevant to hallucinations generated by CodeLLMs. We categorize the types of hallucinations in the code generated by CodeLLMs, review existing benchmarks and mitigation strategies, and identify open challenges. Based on these findings, this survey outlines further research directions in the detection and removal of hallucinations produced by CodeLLMs.

Problem

Research questions and friction points this paper is trying to address.

Identifying hallucinations in code generated by CodeLLMs

Categorizing types of hallucinations in CodeLLM outputs

Developing mitigation strategies for CodeLLM hallucinations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Classify hallucination types in CodeLLMs

Review benchmarks and mitigation strategies

Outline research for hallucination detection

🔎 Similar Papers

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification