🤖 AI Summary
Existing software engineering (SE) AI agent benchmarks (e.g., SWE-bench) overlook critical collaborative workflows such as version control systems (VCS), particularly Git operations. Method: We introduce the first Git-centric AI agent benchmark, systematically defining a Git task evaluation framework and constructing a three-tier dataset: 900 test instances, 120 prototype tasks, and 17,000 expert Git operation trajectories—curated from real-world Python, Java, and Kotlin open-source repositories and annotated via GPT-4o augmented with a custom Git toolchain. Contribution/Results: Evaluation reveals that state-of-the-art models achieve only 21.11% overall task success on the prototype set, exposing a severe VCS capability gap. This benchmark provides a reproducible, quantitative evaluation standard for end-to-end SE agents, enabling rigorous assessment of tool-augmented reasoning and collaborative workflow understanding.
📝 Abstract
Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.