CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation

📅 2026-05-04

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses key limitations in existing research on commit message generation—namely, the absence of large-scale, semantically annotated benchmark datasets and reliable reference-free evaluation metrics. To this end, the authors introduce CommitSuite, the first large-scale multilingual benchmark of Conventional Commits–compliant commits, encompassing 63,533 commits across seven programming languages. Each commit is enriched with abstract syntax tree (AST)-level code change representations and LLM-assisted semantic annotations capturing “what” and “why” information. Furthermore, they propose the first reference-free, five-dimensional semantic evaluation framework assessing reasonableness, comprehensiveness, non-redundancy, faithfulness, and logical coherence. The framework achieves a Cohen’s Kappa agreement of 0.849 with human judgments, demonstrating the effectiveness of large language models both in generating and evaluating commit messages.

📝 Abstract

High-quality commit messages are critical for maintaining software projects, yet ensuring their consistency and informativeness remains a practical challenge. While the Conventional Commits Specification (CCS) provides a structured format for commit messages, research on CCS-based commit classification and commit message generation (CMG) is limited by the absence of large-scale benchmarks, semantic annotations, and reliable evaluation methods. In this paper, we introduce CommitSuite, a benchmark comprising 63,533 CCS-compliant commits from 243 open-source repositories across seven programming languages. Each commit is labeled with its CCS type and enriched with AST-level code changes, along with LLM-assisted semantic annotations that capture the "what" and "why" behind the change. To evaluate CMG systems, we propose a reference-free framework based on five binary metrics: rationality, comprehensiveness, non-redundancy, authenticity, and logicality, enabling semantic-level assessment without relying on human-written references. Our experiments show that LLMs can effectively support both generation and evaluation, with evaluation achieving 0.849 Cohen's Kappa agreement against human judgments. CommitSuite offers a unified resource for structured commit understanding and facilitates reproducible research on commit classification and generation.

Problem

Research questions and friction points this paper is trying to address.

commit classification

commit message generation

Conventional Commits Specification

benchmark

semantic evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

CommitSuite

Conventional Commits

commit message generation