GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

📅 2025-05-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing software engineering (SE) AI agent benchmarks (e.g., SWE-bench) overlook critical collaborative workflows such as version control systems (VCS), particularly Git operations. Method: We introduce the first Git-centric AI agent benchmark, systematically defining a Git task evaluation framework and constructing a three-tier dataset: 900 test instances, 120 prototype tasks, and 17,000 expert Git operation trajectories—curated from real-world Python, Java, and Kotlin open-source repositories and annotated via GPT-4o augmented with a custom Git toolchain. Contribution/Results: Evaluation reveals that state-of-the-art models achieve only 21.11% overall task success on the prototype set, exposing a severe VCS capability gap. This benchmark provides a reproducible, quantitative evaluation standard for end-to-end SE agents, enabling rigorous assessment of tool-augmented reasoning and collaborative workflow understanding.

Technology Category

Application Category

📝 Abstract
Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.
Problem

Research questions and friction points this paper is trying to address.

Evaluating AI agent performance on Git tasks
Addressing gaps in current SE benchmarks for VCS operations
Providing datasets for Git scenario evaluation and training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel benchmark for Git tasks evaluation
Covers three core Git scenarios
Uses GPT-4o with custom tools
🔎 Similar Papers
No similar papers found.
T
Tobias Lindenbauer
JetBrains Research, School of Computation, Information and Technology, Technical University of Munich
Egor Bogomolov
Egor Bogomolov
JetBrains Research
machine learning for software engineering
Yaroslav Zharov
Yaroslav Zharov
Researcher @ JetBrains
Deep LearningAI Agents