🤖 AI Summary
This study is the first to systematically reveal the prevalence of hallucinations in large language models (LLMs) for natural language generation from code changes—specifically, commit message and code review comment generation—finding factual inaccuracies in approximately 50% of review comments and 20% of commit messages.
Method: To address this, we propose the first multi-metric hallucination detection framework tailored to code-change scenarios, which jointly models model confidence, gradient-based feature attribution, and semantic similarity—enabling efficient, fine-tuning-free detection at inference time.
Contribution/Results: Extensive experiments demonstrate that our method significantly outperforms single-metric baselines across diverse LLMs and code-change datasets. It provides a practical, empirically validated technical pathway to enhance the factual reliability and trustworthiness of code-related NLG systems, establishing foundational evidence for hallucination mitigation in software engineering AI applications.
📝 Abstract
Language models have shown strong capabilities across a wide range of tasks in software engineering, such as code generation, yet they suffer from hallucinations. While hallucinations have been studied independently in natural language and code generation, their occurrence in tasks involving code changes which have a structurally complex and context-dependent format of code remains largely unexplored. This paper presents the first comprehensive analysis of hallucinations in two critical tasks involving code change to natural language generation: commit message generation and code review comment generation. We quantify the prevalence of hallucinations in recent language models and explore a range of metric-based approaches to automatically detect them. Our findings reveal that approximately 50% of generated code reviews and 20% of generated commit messages contain hallucinations. Whilst commonly used metrics are weak detectors on their own, combining multiple metrics substantially improves performance. Notably, model confidence and feature attribution metrics effectively contribute to hallucination detection, showing promise for inference-time detection.footnote{All code and data will be released upon acceptance.