AgentPack: A Dataset of Code Changes, Co-Authored by Agents and Humans

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

Existing code editing models rely on noisy human commit data—often terse, multi-task, or contaminated by bot-generated submissions. Method: We introduce AgentPack, the first large-scale human-AI collaborative code editing dataset, comprising 1.3 million high-quality, goal-specific, manually curated code modifications collected up to August 2025. We systematically construct and analyze such data for the first time, innovatively leveraging AI agents to generate structured, semantically rich commit descriptions, and designing an LLM-based commit parsing and GitHub diff filtering pipeline for end-to-end quality control. Contribution/Results: Models fine-tuned on AgentPack significantly outperform baselines trained solely on human commits, achieving substantial improvements in both functional correctness and edit plausibility. This demonstrates the critical value of human-AI collaborative data for modeling intelligent code editing.

Technology Category

Application Category

📝 Abstract

Fine-tuning large language models for code editing has typically relied on mining commits and pull requests. The working hypothesis has been that commit messages describe human intent in natural language, and patches to code describe the changes that implement that intent. However, much of the previously collected data is noisy: commit messages are terse, human-written commits commingle several unrelated edits, and many commits come from simple, rule-based bots. The recent adoption of software engineering agents changes this landscape. Code changes co-authored by humans and agents tend to be more narrowly scoped and focused on clearer goals. Their commit messages, generated by LLMs, articulate intent and rationale in much greater detail. Moreover, when these changes land in public repositories, they are implicitly filtered by humans: maintainers discard low-quality commits to their projects. We present AgentPack, a corpus of 1.3M code edits co-authored by Claude Code, OpenAI Codex, and Cursor Agent across public GitHub projects up to mid-August 2025. We describe the identification and curation pipeline, quantify adoption trends of these agents, and analyze the structural properties of the edits. Finally, we show that models fine-tuned on AgentPack can outperform models trained on prior human-only commit corpora, highlighting the potential of using public data from software engineering agents to train future code-editing models.

Problem

Research questions and friction points this paper is trying to address.

Mining noisy commit data for code editing training

Filtering low-quality commits from human-authored code changes

Improving code-editing models using agent-coauthored datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using agent-human co-authored code edits

Filtering commits via maintainer quality control

Training models on curated agent-generated datasets

🔎 Similar Papers

No similar papers found.