SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing code repair benchmarks primarily focus on static, one-off tasks, making them ill-suited for evaluating agents’ capabilities in long-term software maintenance. To address this limitation, this work proposes SWE-CI, the first repository-level benchmark grounded in continuous integration cycles. Leveraging historical commit data from real-world open-source projects, SWE-CI constructs an evaluation framework comprising 100 long-term evolution tasks, each spanning an average of 233 days and 71 consecutive commits. This benchmark shifts the evaluation focus from static functional correctness to dynamic maintainability, systematically assessing agents’ ability to preserve code quality across realistic, iterative software development scenarios through multi-round analysis and coding tasks.

Technology Category

Application Category

πŸ“ Abstract
Large language model (LLM)-powered agents have demonstrated strong capabilities in automating software engineering tasks such as static bug fixing, as evidenced by benchmarks like SWE-bench. However, in the real world, the development of mature software is typically predicated on complex requirement changes and long-term feature iterations -- a process that static, one-shot repair paradigms fail to capture. To bridge this gap, we propose \textbf{SWE-CI}, the first repository-level benchmark built upon the Continuous Integration loop, aiming to shift the evaluation paradigm for code generation from static, short-term \textit{functional correctness} toward dynamic, long-term \textit{maintainability}. The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository. SWE-CI requires agents to systematically resolve these tasks through dozens of rounds of analysis and coding iterations. SWE-CI provides valuable insights into how well agents can sustain code quality throughout long-term evolution.
Problem

Research questions and friction points this paper is trying to address.

code maintainability
continuous integration
software evolution
LLM agents
long-term code quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

SWE-CI
Continuous Integration
Code Maintainability
LLM-powered Agents
Repository-level Benchmark
πŸ”Ž Similar Papers
No similar papers found.