🤖 AI Summary
This work addresses key limitations in existing research on commit message generation—namely, the absence of large-scale, semantically annotated benchmark datasets and reliable reference-free evaluation metrics. To this end, the authors introduce CommitSuite, the first large-scale multilingual benchmark of Conventional Commits–compliant commits, encompassing 63,533 commits across seven programming languages. Each commit is enriched with abstract syntax tree (AST)-level code change representations and LLM-assisted semantic annotations capturing “what” and “why” information. Furthermore, they propose the first reference-free, five-dimensional semantic evaluation framework assessing reasonableness, comprehensiveness, non-redundancy, faithfulness, and logical coherence. The framework achieves a Cohen’s Kappa agreement of 0.849 with human judgments, demonstrating the effectiveness of large language models both in generating and evaluating commit messages.
📝 Abstract
High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.