PerfBench: Can Agents Resolve Real-World Performance Bugs?

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing software engineering agent benchmarks emphasize functional correctness while neglecting non-functional performance defects—such as inefficient code that wastes resources without causing crashes. Method: We introduce PerfBench, the first benchmark dedicated to performance-defect repair, comprising 81 real-world tasks extracted from GitHub .NET projects. We design an end-to-end evaluation framework supporting automated execution, quantitative measurement of performance differences (e.g., latency, memory usage), and human validation. Our approach innovatively integrates performance test generation, output analysis tools, and performance-aware instruction design to enable reproducible and verifiable assessment of AI agents’ repair capabilities. Contribution/Results: Experiments reveal that state-of-the-art agents (e.g., OpenHands) achieve only ~3% success rate on PerfBench; a purpose-built variant, OpenHands-Perf-Agent, improves this to 20%. These results underscore a critical gap in current AI agents’ ability to diagnose and repair performance issues—and highlight substantial room for advancement.

Technology Category

Application Category

📝 Abstract
Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown promise in automated bug fixing, existing benchmarks primarily focus on functional correctness and fail to evaluate agents' abilities to identify and resolve non-functional issues like performance bugs. We introduce PerfBench, a benchmark comprising 81 real-world performance bug-fixing tasks from popular .NET repositories on GitHub. Unlike existing benchmarks that rely on pre-existing test suites, PerfBench features a novel evaluation harness that allows agents to generate their own performance benchmarks and validates fixes by comparing execution metrics collected for developer fix and agent fix. Each task in PerfBench is derived from actual developer fixes linked to performance-related issues, which are then verified by human experts, ensuring real-world relevance. Our evaluation reveals that current state-of-the-art coding agents struggle with performance optimization tasks, with baseline OpenHands agent achieving only a ~3% success rate on our benchmark. We develop OpenHands-Perf-Agent, which incorporates performance-aware tooling and instructions and achieves a ~20% success rate on the benchmark. We show that by ensuring the agent has proper instructions to benchmark its changes and tooling for benchmark output processing, we can improve the agent performance significantly, but room for improvement still remains. PerfBench provides a challenging test set for furthering the capabilities of agents in fixing performance issues.
Problem

Research questions and friction points this paper is trying to address.

Evaluating agents' ability to resolve real-world performance bugs
Introducing a benchmark for non-functional software inefficiencies
Assessing automated bug fixing for computational resource optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

PerfBench benchmark with 81 real-world performance bug tasks
Novel evaluation harness allowing agents to generate benchmarks
Performance-aware tooling and instructions improving agent success rate
🔎 Similar Papers
No similar papers found.