A Benchmark for Evaluating Repository-Level Code Agents with Intermediate Reasoning on Feature Addition Task

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing code agent benchmarks rely solely on black-box evaluation based on final test correctness, making it difficult to diagnose reasoning processes or failure causes. To address this limitation, this work proposes RACE-bench, a new benchmark comprising 528 real-world, repository-level feature addition tasks. RACE-bench introduces structured intermediate reasoning annotations and executable patch validation for the first time, and features a dual-track evaluation framework that jointly assesses both patch correctness and reasoning quality. Evaluations using real open-source data and multi-dimensional automated metrics reveal that leading agents achieve overall solve rates of 29%–70%, with a notable degradation in reasoning quality during the implementation phase. These findings underscore the necessity of fine-grained reasoning evaluation and demonstrate the effectiveness of the proposed benchmark.
📝 Abstract
Repository-level code agents have shown strong promise in real-world feature addition tasks, making reliable evaluation of their capabilities increasingly important. However, existing benchmarks primarily evaluate these agents as black boxes based on final test correctness, providing limited insight into how they reason and where failures arise. To address this limitation, we introduce RACE-bench, a reasoning-augmented benchmark for evaluating code agents on repository-level feature addition tasks. RACE-bench contains 528 real-world feature addition instances from 12 open-source repositories. Each instance is paired with executable patch verification and structured intermediate reasoning ground truth covering issue understanding, file localization, implementation tasks, and step decomposition. Based on this design, we introduce a dual-track evaluation framework that jointly measures patch correctness and intermediate reasoning quality. We evaluate three representative repository-level code agents on RACE-bench. On the full benchmark, Resolved Rates range from 29% to 70% across different agents. Our reasoning-level analysis further shows that while current agents perform well at understanding high-level intent, their performance degrades substantially when translating intent into concrete implementation steps. We also find that apply-success but test-fail cases exhibit lower reasoning recall (35.7% decrease) and higher over-prediction (94.1% increase) compared to successful cases. These findings highlight the importance of evaluating repository-level code agents beyond final patch correctness by examining the quality of their reasoning processes.
Problem

Research questions and friction points this paper is trying to address.

repository-level code agents
feature addition
intermediate reasoning
evaluation benchmark
reasoning quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

repository-level code agents
intermediate reasoning
feature addition
evaluation benchmark
reasoning-augmented evaluation
🔎 Similar Papers
No similar papers found.