ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

Existing automated software release log generation research suffers from unclear dataset licensing, poor reproducibility, and coarse-grained task design—relying solely on commit messages while neglecting commit structure and code changes. To address these issues, this paper introduces ReleaseEval: the first open-source, reproducible benchmark comprising 3,369 repositories across six programming languages. We propose a novel multi-granularity evaluation framework comprising three tasks—commit2sum (summarizing commit messages), tree2sum (summarizing commit tree structures), and diff2sum (summarizing code diffs)—enabling the first systematic assessment of large language models’ abstraction capabilities in fine-grained version control contexts. Experimental results demonstrate that large models significantly outperform traditional baselines on tree2sum, yet remain challenged by long-code-diff summarization (diff2sum), highlighting critical limitations in modeling structural and semantic complexity of code evolution.

Technology Category

Application Category

📝 Abstract

Automated release note generation addresses the challenge of documenting frequent software updates, where manual efforts are time-consuming and prone to human error. Although recent advances in language models further enhance this process, progress remains hindered by dataset limitations, including the lack of explicit licensing and limited reproducibility, and incomplete task design that relies mainly on commit messages for summarization while overlooking fine-grained contexts such as commit hierarchies and code changes. To fill this gap, we introduce ReleaseEval, a reproducible and openly licensed benchmark designed to systematically evaluate language models for automated release note generation. ReleaseEval comprises 94,987 release notes from 3,369 repositories across 6 programming languages, and supports three task settings with three levels of input granularity: (1) commit2sum, which generates release notes from commit messages; (2) tree2sum, which incorporates commit tree structures; and (3) diff2sum, which leverages fine-grained code diffs. Both automated and human evaluations show that large language models consistently outperform traditional baselines across all tasks, achieving substantial gains on tree2sum, while still struggling on diff2sum. These findings highlight LLMs'proficiency in leveraging structured information while revealing challenges in abstracting from long code diffs.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models for automated release note generation

Addressing dataset limitations and incomplete task design

Systematic assessment across multiple input granularity levels

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces ReleaseEval benchmark for automated release notes

Supports three granular tasks from commits to code diffs

Evaluates LLMs using structured commit trees and code changes

🔎 Similar Papers

No similar papers found.