CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification

📅 2024-04-30
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) frequently generate code exhibiting “code hallucination”—syntactically plausible yet semantically flawed outputs involving incorrect semantic mappings, improper naming, resource management errors, or logical inconsistencies, leading to functional failure. Method: This work introduces the first systematic definition and fine-grained taxonomy of code hallucination into four categories, proposes an execution-driven dynamic detection paradigm, and develops CodeHaluEval—the first open-source benchmark specifically designed for quantitative evaluation of code hallucination (comprising 8,883 high-quality samples). Leveraging execution-based validation and multi-dimensional modeling, we design the lightweight detection algorithm CodeHalu. Contribution/Results: We evaluate 17 state-of-the-art LLMs across 699 programming tasks, revealing substantial disparities in their code reliability. All tools, benchmarks, and evaluation protocols are publicly released, establishing foundational infrastructure and standardized metrics for trustworthy code generation research.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have made significant progress in code generation, offering developers groundbreaking automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible, but may not execute as expected or fulfill specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To advance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We categorize code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we present a dynamic detection algorithm called CodeHalu designed to detect and quantify code hallucinations. We also introduce the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks, to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs using this benchmark, we reveal significant differences in their accuracy and reliability in code generation, offering detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Code Hallucination
Software Errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

CodeHalu
Error Detection
Code Hallucinations
🔎 Similar Papers
No similar papers found.
Yuchen Tian
Yuchen Tian
HKBU
Code Intelligence
Weixiang Yan
Weixiang Yan
Amazon
Code IntelligenceAgentic RLSoftware Automation
Q
Qian Yang
Mila - Quebec AI Institute, Université de Montréal
Xuandong Zhao
Xuandong Zhao
UC Berkeley
Machine LearningNatural Language ProcessingAI Safety
Q
Qian Chen
Alibaba Group
W
Wen Wang
Alibaba Group
Ziyang Luo
Ziyang Luo
Salesforce AI Research
AgentsLLMsMultimodal
L
Lei Ma
The University of Tokyo, University of Alberta