SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

While large language models (LLMs) excel at code generation and repair, their capability to perform systematic, repository-level performance optimization remains unassessed. Method: We introduce SWE-Perf, the first benchmark for warehouse-scale performance optimization, comprising 140 real-world, high-performance pull requests from GitHub—each including full repositories, target functions, reproducible performance tests, and expert-provided patches. We systematically evaluate mainstream LLMs using file- and repository-level agents (e.g., Agentless, OpenHands) within a unified, executable environment. Contribution/Results: Experiments reveal a substantial performance gap between current LLMs and human experts, exposing critical deficiencies in modeling complex cross-file dependencies and understanding performance-sensitive code semantics. This work establishes a new research direction for intelligent code performance optimization and provides a reproducible, extensible evaluation infrastructure to advance principled development of LLM-based performance engineering tools.

Technology Category

Application Category

📝 Abstract

Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLMs' ability to optimize code performance in real repositories

Assess performance gap between LLMs and expert-level code optimization

Develop benchmark for systematic LLM evaluation in repository contexts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SWE-Perf benchmark for LLMs

Evaluates LLMs on repository-level optimization

Includes 140 real-world GitHub instances

🔎 Similar Papers

No similar papers found.