🤖 AI Summary
This work addresses the practical challenge of inconsistent code and documentation changes during software maintenance. We introduce CodeDocAlign, the first large-scale, real-world, GitHub commit-driven dataset for fine-grained code–docstring co-evolution. Methodologically, we propose the first automated mining framework integrating Git history analysis, static parsing, and heuristic filtering to precisely extract semantically aligned bidirectional change pairs. We further define multiple realistic alignment evaluation tasks. Benchmarking with state-of-the-art open-weight LLMs—including Llama-3.1 405B and Mixtral 8×22B—reveals severe deficiencies in both code-to-docstring and docstring-to-code inference. CodeDocAlign provides a reliable training resource and standardized evaluation benchmark for modeling code–documentation co-evolution, thereby filling a critical gap in maintenance-oriented AI research.
📝 Abstract
One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate the set of edits to that function to implement the desired new operation; likewise, given a set of edits to a function, an agent might be asked to generate a changed description, of that function's new workings. Thus, there is an incentive to train a neural model for change-related tasks. Motivated by this, we offer a new,"natural", large dataset of coupled changes to code and documentation mined from actual high-quality GitHub projects, where each sample represents a single commit where the code and the associated docstring were changed together. We present the methodology for gathering the dataset, and some sample, challenging (but realistic) tasks where our dataset provides opportunities for both learning and evaluation. We find that current models (specifically Llama-3.1 405B, Mixtral 8$ imes$22B) do find these maintenance-related tasks challenging.