CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

📅 2025-02-01

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the practical challenge of inconsistent code and documentation changes during software maintenance. We introduce CodeDocAlign, the first large-scale, real-world, GitHub commit-driven dataset for fine-grained code–docstring co-evolution. Methodologically, we propose the first automated mining framework integrating Git history analysis, static parsing, and heuristic filtering to precisely extract semantically aligned bidirectional change pairs. We further define multiple realistic alignment evaluation tasks. Benchmarking with state-of-the-art open-weight LLMs—including Llama-3.1 405B and Mixtral 8×22B—reveals severe deficiencies in both code-to-docstring and docstring-to-code inference. CodeDocAlign provides a reliable training resource and standardized evaluation benchmark for modeling code–documentation co-evolution, thereby filling a critical gap in maintenance-oriented AI research.

Technology Category

Application Category

📝 Abstract

One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function's new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new,"natural", large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama-3.1 405B, Mixtral 8$ imes$22B) do find these maintenance-related tasks challenging.

Problem

Research questions and friction points this paper is trying to address.

Dataset Development

Neural Models

Code-Documentation Consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

CoDocBench

AI-assisted software maintenance

code-documentation consistency

🔎 Similar Papers

Automated Test Case Repair Using Language Models