SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

📅 2025-11-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks predominantly focus on code defect repair, neglecting the critical capability of *how to optimize* software performance. Method: This paper introduces SWE-fficiency, the first repository-level performance optimization benchmark grounded in real-world workloads. It comprises 498 tasks across nine prominent open-source projects (e.g., NumPy, Pandas), requiring agents to identify performance bottlenecks and generate efficient patches while preserving functional correctness. Tasks are automatically constructed from expert performance-improving commits via a novel pipeline integrating keyword filtering, static analysis, test coverage analysis, and execution-based validation. Contribution/Results: Experiments reveal that state-of-the-art agents achieve only 15% of expert-level speedup on average, exposing severe limitations in cross-function reasoning, bottleneck localization, and correctness-preserving optimization—highlighting a fundamental gap in current AI-driven software engineering capabilities.

Technology Category

Application Category

📝 Abstract
Optimizing the performance of large-scale software repositories demands expertise in code reasoning and software engineering (SWE) to reduce runtime while preserving program correctness. However, most benchmarks emphasize what to fix rather than how to fix code. We introduce SWE-fficiency, a benchmark for evaluating repository-level performance optimization on real workloads. Our suite contains 498 tasks across nine widely used data-science, machine-learning, and HPC repositories (e.g., numpy, pandas, scipy): given a complete codebase and a slow workload, an agent must investigate code semantics, localize bottlenecks and relevant tests, and produce a patch that matches or exceeds expert speedup while passing the same unit tests. To enable this how-to-fix evaluation, our automated pipeline scrapes GitHub pull requests for performance-improving edits, combining keyword filtering, static analysis, coverage tooling, and execution validation to both confirm expert speedup baselines and identify relevant repository unit tests. Empirical evaluation of state-of-the-art agents reveals significant underperformance. On average, agents achieve less than 0.15x the expert speedup: agents struggle in localizing optimization opportunities, reasoning about execution across functions, and maintaining correctness in proposed edits. We release the benchmark and accompanying data pipeline to facilitate research on automated performance engineering and long-horizon software reasoning.
Problem

Research questions and friction points this paper is trying to address.

Evaluating repository-level performance optimization on real workloads
Automating code bottleneck localization and patch generation
Addressing limitations in long-horizon software reasoning and correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline scrapes GitHub pull requests
Combines keyword filtering with static analysis
Uses coverage tooling and execution validation
🔎 Similar Papers
No similar papers found.