GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

📅 2025-05-29

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

This work addresses the lack of rigorous evaluation of SWE-Agents on runtime performance optimization tasks. We introduce GSO, the first benchmark specifically designed for real-world software performance optimization: it comprises 102 cross-language (C, Rust, Python, etc.) and cross-domain practical optimization tasks, with expert-level optimization outcomes serving as the ground-truth standard. Methodologically, we pioneer automated extraction of high-difficulty performance-bottleneck tasks from open-source commit histories; propose a quantitative evaluation paradigm grounded in precise performance testing; and establish an automated testing pipeline with multi-language baseline comparisons. Experiments reveal that state-of-the-art SWE-Agents achieve less than 5% success rate, with diminishing returns from inference-time scaling—exposing fundamental deficiencies in low-level semantic understanding, bottleneck localization, and optimization strategy generation. All data, scripts, and failure trajectories are publicly released to advance reproducible research on agent capabilities.

Technology Category

Application Category

📝 Abstract

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' ability to optimize software performance

Identifying challenging optimization tasks across diverse codebases

Analyzing agent failures in low-level languages and bottleneck localization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for performance test generation

Benchmark with 102 optimization tasks

Quantitative and qualitative agent evaluation

🔎 Similar Papers

A Systematic Literature Review on Explainability for Machine/Deep Learning-based Software Engineering Research