BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing?

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

Current evaluations of code agents are largely confined to simple, single-repository bug fixes, failing to address real-world software development challenges such as cross-repository reasoning, domain specialization, dependency migration, and whole-repository code generation. To bridge this gap, this work proposes BeyondSWE—the first systematic benchmark that extends beyond single-repository repair through four task categories grounded in 500 real-world instances. We also introduce SearchSWE, a framework that integrates deep search with code generation to emulate the interleaved search-and-reasoning workflow of human developers. Experiments reveal that state-of-the-art models achieve success rates below 45% on BeyondSWE and exhibit significant performance instability. Moreover, search augmentation yields highly task-dependent benefits and can even degrade performance, exposing critical limitations of current agents in realistic development scenarios.

Technology Category

Application Category

📝 Abstract

Current benchmarks for code agents primarily assess narrow, repository-specific fixes, overlooking critical real-world challenges such as cross-repository reasoning, domain-specialized problem solving, dependency-driven migration, and full-repository generation. To address this gap, we introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope - using 500 real-world instances across four distinct settings. Experimental results reveal a significant capability gap: even frontier models plateau below 45% success, and no single model performs consistently across task types. To systematically investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities. Our experiments show that search augmentation yields inconsistent gains and can in some cases degrade performance, highlighting the difficulty of emulating developer-like workflows that interleave search and reasoning during coding tasks. This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.

Problem

Research questions and friction points this paper is trying to address.

code agent

cross-repository reasoning

domain-specialized problem solving

dependency-driven migration

full-repository generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

BeyondSWE

code agent benchmark

cross-repository reasoning