BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

๐Ÿ“… 2026-03-03
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current evaluations of code agents are largely confined to simple, single-repository bug fixes, failing to address real-world software development challenges such as cross-repository reasoning, domain specialization, dependency migration, and whole-repository code generation. To bridge this gap, this work proposes BeyondSWEโ€”the first systematic benchmark that extends beyond single-repository repair through four task categories grounded in 500 real-world instances. We also introduce SearchSWE, a framework that integrates deep search with code generation to emulate the interleaved search-and-reasoning workflow of human developers. Experiments reveal that state-of-the-art models achieve success rates below 45% on BeyondSWE and exhibit significant performance instability. Moreover, search augmentation yields highly task-dependent benefits and can even degrade performance, exposing critical limitations of current agents in realistic development scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
Problem

Research questions and friction points this paper is trying to address.

code agent
cross-repository reasoning
domain-specialized problem solving
dependency-driven migration
full-repository generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

BeyondSWE
code agent benchmark
cross-repository reasoning
search-augmented coding
SearchSWE
๐Ÿ”Ž Similar Papers
G
Guoxin Chen
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
F
Fanzhe Meng
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
J
Jiale Zhao
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
Minghao Li
Minghao Li
Beihang University
Natural Language Processing
Daixuan Cheng
Daixuan Cheng
Gaoling School of AI, Renmin University of China
LLM Pre-TrainingDomain AdaptationReasoning
Huatong Song
Huatong Song
GSAI, Renmin University of China
Large Language Models
Jie Chen
Jie Chen
Renmin University of China
Large Language ModelsNatural Language ProcessingReinforcement LearningPre-training
Y
Yuzhi Lin
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
H
Hui Chen
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
X
Xin Zhao
1Gaoling School of Artificial Intelligence, Renmin University of China, 2Independent Researcher, 3AweAI Team
Ruihua Song
Ruihua Song
Renmin University of China
AI based creationmulti-modaltiy chitchatnatural language understandinginformation retrievalinformation extraction
Chang Liu
Chang Liu
Assistant Professor at Budapest University of Technology and Economics; Research Fellow at SZTAKI
Artificial IntelligenceMachine VisionDeep LearningAutonomous DrivingRemote Sensing
Cheng Chen
Cheng Chen
East China Normal University
Online LearningOptimizationNumerical Linear Algebra
Kai Jia
Kai Jia
MIT
Ji-Rong Wen
Ji-Rong Wen
Gaoling School of Artificial Intelligence, Renmin University of China
Large Language ModelWeb SearchInformation RetrievalMachine Learning