CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This paper addresses the problem of Message-Change Inconsistency (MCI)—a critical issue where commit messages misrepresent actual code changes—leading to flawed code reviews, degraded maintainability, contamination of empirical software engineering datasets, and obfuscation of security patches. To tackle this, we introduce CodeFuse-CommitEval, the first benchmark dedicated to MCI detection. Methodologically, we propose seven rule-guided strategies for generating diverse inconsistent samples and ensure high data quality via human-in-the-loop and LLM-based dual validation. Leveraging the ApacheCM dataset, we evaluate large language models (LLMs) using three enhancement techniques: few-shot prompting, chain-of-thought reasoning, and extended context windows. Experimental results show that the best-performing model, gpt-oss-20B, achieves 85.95% recall and 80.28% precision on MCI detection. Furthermore, our analysis reveals substantial variation in detection difficulty and contextual dependency across MCI types.

Technology Category

Application Category

📝 Abstract

Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and, more critically, inconsistent with their diffs-known as message-code inconsistency (MCI). MCIs mislead reviewers, hinder maintenance, contaminate research datasets, and may obscure security patches. Yet, no dedicated benchmark exists to evaluate models for MCI detection. We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs). Built on the ApacheCM dataset for diversity and quality, we generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples. Using this labeled dataset of message-diff pairs, we evaluate six state-of-the-art open-source LLMs under a vanilla setting and with three augmentation strategies: few-shot prompting, chain-of-thought, and extended context. Results show models detect inconsistent commits more reliably than consistent ones (average Recall 85.95%, Precision 80.28%, Specificity 63.8%); gpt-oss-20B performs best overall but uses over twice the tokens of others. Augmentation effects vary: adjacent context helps larger models but adds noise for smaller ones; few-shot improves accuracy and reduces token use, yet increases universally incorrect predictions; chain-of-thought boosts precision and specificity at the cost of recall and higher token consumption. Type-wise analysis reveals higher detectability for component, file-path, and operation inconsistencies, but lower accuracy and higher token cost for intent-level "purpose" inconsistencies. CODEFUSE-COMMITEVAL provides a rigorous foundation for measuring, comparing, and advancing MCI detection, highlighting the need for richer context and balanced data to capture high-level semantic gaps.

Problem

Research questions and friction points this paper is trying to address.

Detecting inconsistencies between commit messages and code changes

Benchmarking large language models for message-code inconsistency detection

Evaluating model performance across different types of commit inconsistencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

First benchmark for commit message inconsistency detection

Rule-guided mutation generates seven inconsistency types

Evaluates LLMs with three augmentation strategies

🔎 Similar Papers

No similar papers found.