Hallucinations in Code Change to Natural Language Generation: Prevalence and Evaluation of Detection Metrics

📅 2025-08-12

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study is the first to systematically reveal the prevalence of hallucinations in large language models (LLMs) for natural language generation from code changes—specifically, commit message and code review comment generation—finding factual inaccuracies in approximately 50% of review comments and 20% of commit messages. Method: To address this, we propose the first multi-metric hallucination detection framework tailored to code-change scenarios, which jointly models model confidence, gradient-based feature attribution, and semantic similarity—enabling efficient, fine-tuning-free detection at inference time. Contribution/Results: Extensive experiments demonstrate that our method significantly outperforms single-metric baselines across diverse LLMs and code-change datasets. It provides a practical, empirically validated technical pathway to enhance the factual reliability and trustworthiness of code-related NLG systems, establishing foundational evidence for hallucination mitigation in software engineering AI applications.

Technology Category

Application Category

📝 Abstract

Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.footnote{All code and data will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Study hallucinations in code change to natural language generation tasks

Evaluate detection metrics for hallucinations in commit messages and code reviews

Assess prevalence of hallucinations in language models for software engineering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzes hallucinations in code change tasks

Combines metrics for improved detection

Uses model confidence for detection

🔎 Similar Papers

CodeMirage: Hallucinations in Code Generated by Large Language Models