Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Existing software engineering agent benchmarks (e.g., SWE-Bench Verified) rely on GitHub issue descriptions, which diverge significantly from real-world developer interactions with conversational coding assistants inside IDEs—leading to systematic overestimation of capabilities such as bug fixing. Method: We propose the first benchmark mutation framework grounded in empirical developer interaction analysis, integrating telemetry data, query reformulation, and natural language transformation techniques to convert formal problem statements into dialogue-style queries that faithfully reflect authentic user intent. Contribution/Results: Experiments across multiple public and private benchmarks reveal that conventional benchmarks overestimate agent performance by over 50%. Our work not only identifies the root causes of evaluation bias but also establishes a novel, interaction-aware evaluation paradigm for conversational programming agents—substantially enhancing assessment fidelity and validity.

Technology Category

Application Category

📝 Abstract

Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.

Problem

Research questions and friction points this paper is trying to address.

Transforms GitHub issues into realistic developer queries

Addresses overestimation of agent capabilities in real scenarios

Creates new evaluation paradigm for chat-based coding assistants

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms formal benchmarks into realistic user queries

Mutates GitHub issues using developer interaction patterns

Applies telemetry analysis to create chat-style evaluations

🔎 Similar Papers

No similar papers found.