GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

Existing software engineering (SE) AI agent benchmarks (e.g., SWE-bench) overlook critical collaborative workflows such as version control systems (VCS), particularly Git operations. Method: We introduce the first Git-centric AI agent benchmark, systematically defining a Git task evaluation framework and constructing a three-tier dataset: 900 test instances, 120 prototype tasks, and 17,000 expert Git operation trajectories—curated from real-world Python, Java, and Kotlin open-source repositories and annotated via GPT-4o augmented with a custom Git toolchain. Contribution/Results: Evaluation reveals that state-of-the-art models achieve only 21.11% overall task success on the prototype set, exposing a severe VCS capability gap. This benchmark provides a reproducible, quantitative evaluation standard for end-to-end SE agents, enabling rigorous assessment of tool-augmented reasoning and collaborative workflow understanding.

Technology Category

Application Category

📝 Abstract

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agent performance on Git tasks

Addressing gaps in current SE benchmarks for VCS operations

Providing datasets for Git scenario evaluation and training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for Git tasks evaluation

Covers three core Git scenarios

Uses GPT-4o with custom tools

🔎 Similar Papers

No similar papers found.