CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

๐Ÿ“… 2025-11-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses the problem of Message-Change Inconsistency (MCI)โ€”a critical issue where commit messages misrepresent actual code changesโ€”leading to flawed code reviews, degraded maintainability, contamination of empirical software engineering datasets, and obfuscation of security patches. To tackle this, we introduce CodeFuse-CommitEval, the first benchmark dedicated to MCI detection. Methodologically, we propose seven rule-guided strategies for generating diverse inconsistent samples and ensure high data quality via human-in-the-loop and LLM-based dual validation. Leveraging the ApacheCM dataset, we evaluate large language models (LLMs) using three enhancement techniques: few-shot prompting, chain-of-thought reasoning, and extended context windows. Experimental results show that the best-performing model, gpt-oss-20B, achieves 85.95% recall and 80.28% precision on MCI detection. Furthermore, our analysis reveals substantial variation in detection difficulty and contextual dependency across MCI types.

Technology Category

Application Category

๐Ÿ“ Abstract
Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.
Problem

Research questions and friction points this paper is trying to address.

Detecting inconsistencies between commit messages and code changes
Benchmarking large language models for message-code inconsistency detection
Evaluating model performance across different types of commit inconsistencies
Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for commit message inconsistency detection
Rule-guided mutation generates seven inconsistency types
Evaluates LLMs with three augmentation strategies
๐Ÿ”Ž Similar Papers
No similar papers found.