SWE Context Bench: A Benchmark for Context Learning in Coding

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing benchmarks for programming agents lack evaluation of cross-task experience reuse. This work proposes SWE-ContextBench, an extension of SWE-Bench Lite that constructs context-sharing sequences comprising 300 base tasks and 99 related tasks, thereby enabling the first systematic modeling and assessment of agents’ ability to accumulate, retrieve, and apply experience within real-world codebases. By leveraging authentic dependency relationships between GitHub issues and pull requests, the benchmark supports both oracle-guided and autonomous retrieval mechanisms and allows comparison between full execution trajectories and summarized representations. Experiments demonstrate that well-summarized historical experience significantly improves accuracy on challenging tasks while substantially reducing runtime and token consumption; in contrast, unfiltered experience yields limited or even detrimental performance.

Technology Category

Application Category

πŸ“ Abstract
Large language models are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.
Problem

Research questions and friction points this paper is trying to address.

context learning
experience reuse
programming agents
software engineering
benchmark
Innovation

Methods, ideas, or system contributions that make the work stand out.

experience reuse
programming agents
context learning
benchmark
code summarization
πŸ”Ž Similar Papers
No similar papers found.
J
Jared Zhu
Independent Researcher
M
Minhao Hu
University of Oxford
Junde Wu
Junde Wu
University of Oxford
Artificial IntelligenceAI for Medical Science