RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Existing code repair benchmarks are largely confined to single-file, single-issue scenarios, making them inadequate for evaluating agents’ capabilities in real-world software development, which often involves cross-file coordination, long-horizon planning, and multi-objective iteration. To address this gap, this work proposes RoadmapBench—the first benchmark designed to assess long-horizon, multi-objective development at realistic engineering scale. Built upon 115 version-upgrade tasks spanning five programming languages across 17 open-source repositories, RoadmapBench employs fine-grained roadmap instructions to guide agents toward target functionalities and incorporates an automated evaluation framework for systematic assessment. Experimental results reveal that even the strongest current model, Claude-Opus-4.7, completes only 39.1% of the tasks—significantly underperforming compared to its results on existing repair benchmarks—thereby underscoring long-horizon software development as a substantial unresolved challenge.

📝 Abstract

Coding agents are increasingly deployed in real software development, where a single version iteration requires months of coordinated work across many files. However, most existing benchmarks focus predominantly on single-issue bug fixes from Python repositories, with coarse pass/fail evaluation outcomes, and thus fail to capture long-horizon, multi-target development at real engineering scale. To address this gap, we present RoadmapBench, a benchmark of 115 long-horizon coding tasks grounded in real open-source version upgrades across 17 repositories and 5 programming languages. Each task places the agent on a source-version code snapshot and provides a multi-target roadmap instruction requiring it to implement the functionality introduced in the target version, with a median modification of 3,700 lines across 51 files. We conduct a systematic evaluation on thirteen frontier models and find that even the strongest, Claude-Opus-4.7, resolves only 39.1% of tasks, while the weakest achieves merely 5.2%, in stark contrast to existing bug-fix benchmarks, suggesting that long-horizon software development remains a largely unsolved problem.

Problem

Research questions and friction points this paper is trying to address.

long-horizon software development

agentic coding

version upgrades

multi-target development

software engineering benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-horizon software development

agentic coding

version upgrade benchmark