NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks for coding agents inadequately evaluate long-horizon software system construction—specifically, the ability to maintain coherent planning and execution over hours to days in realistic repository-scale development. Method: We introduce the first evaluation benchmark for *long-horizon repository generation*, requiring models to autonomously design architecture, manage dependencies, implement multiple modules, and deliver an installable Python package—starting solely from natural language requirements and an empty workspace. We formalize and quantify this capability, identify critical failure modes (e.g., premature termination, global incoherence, cross-file dependency fragility), and build an end-to-end executable test environment grounded in real-world Python ecosystems—featuring installation validation, multi-file consistency checking, coverage-driven assessment, and step-by-step execution tracing. Results: Experiments show that state-of-the-art models achieve <40% average test pass rate, underscoring long-horizon reasoning as a fundamental bottleneck for autonomous coding agents.

Technology Category

Application Category

📝 Abstract
Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.
Problem

Research questions and friction points this paper is trying to address.

Evaluates long-horizon repository generation in coding agents
Assesses autonomous design and multi-module implementation from requirements
Identifies failure modes like coherence loss and inadequate planning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for long-horizon repository generation evaluation
Agents autonomously design architecture from natural language requirements
Reveals failure modes like premature termination and coherence loss
🔎 Similar Papers
No similar papers found.
J
Jingzhe Ding
ByteDance Seed China
S
Shengda Long
M-A-P
C
Changxin Pu
2077AI
Huan Zhou
Huan Zhou
Northwestern Polytechnical University
Mobile Edge ComputingFederated LearningMobile Social NetworksVANETsData Offloading
H
Hongwan Gao
Nanjing University
X
Xiang Gao
Peking University
C
Chao He
Beijing University of Posts and Telecommunications
Y
Yue Hou
Beihang University
F
Fei Hu
ByteDance Seed China
Zhaojian Li
Zhaojian Li
Red Cedar Distinguished Associate Professor, Michigan State University
ControlsLearningRoboticsConnected VehiclesSmart Agriculture
W
Weiran Shi
2077AI
Zaiyuan Wang
Zaiyuan Wang
ByteDance
AILLMFunction CallAgent
Daoguang Zan
Daoguang Zan
ByteDance Seed
Large Language ModelSoftware EngineeringCoding Agent
C
Chenchen Zhang
Peking University
X
Xiaoxu Zhang
Beijing University of Posts and Telecommunications
Qizhi Chen
Qizhi Chen
PhD Candidate of Zhejiang University
Multimodal ReasoningEmbodied AI3D Vision
X
Xianfu Cheng
ByteDance Seed China
B
Bo Deng
M-A-P
Q
Qingshui Gu
2077AI
K
Kai Hua
Humanlaya Data
J
Juntao Lin
Nanjing University
Pai Liu
Pai Liu
University of Rochester
AI4HealthcareWeb AgentLLM
M
Mingchen Li
Beijing University of Posts and Telecommunications
X
Xuanguang Pan
Beihang University
Zifan Peng
Zifan Peng
Ph.D. Candidate at HKUST(GZ)
DeFiTrustworthy AI