Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings

πŸ“… 2024-10-15
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the misalignment between online deployment and offline evaluation in commit message generation (CMG) systems. We propose an online evaluation paradigm grounded in users’ actual editing effort and introduce the first real-world workflow-driven CMG human-refinement dataset (656 pairs). Our method leverages GPT-4 to generate initial commit messages and employs a custom IDE interaction simulator to capture fine-grained user editing behaviors. We evaluate using edit distance, BLEU, and METEOR. Results show that edit distance correlates strongly with human editing behavior (p < 0.01) and significantly outperforms conventional automatic metrics, demonstrating their inadequacy in reflecting real user experience. This study bridges the gap between online user behavior and offline metric-based evaluation, enabling rapid, user-experiment-free system iteration. All code, data, and tools are publicly released.

Technology Category

Application Category

πŸ“ Abstract
When a Commit Message Generation (CMG) system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers within controlled environments. We release all the code and the dataset to support future research in the field: https://jb.gg/cmg-evaluation.
Problem

Research questions and friction points this paper is trying to address.

Code Commit Messages
AI System Evaluation
Real-world User Scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Code Submission Feedback
Off-line Testing Evaluation
Modified Word Count
πŸ”Ž Similar Papers
No similar papers found.