Saving SWE-Bench: A Benchmark Mutation Approach for Realistic Agent Evaluation

📅 2025-10-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing software engineering agent benchmarks (e.g., SWE-Bench Verified) rely on GitHub issue descriptions, which diverge significantly from real-world developer interactions with conversational coding assistants inside IDEs—leading to systematic overestimation of capabilities such as bug fixing. Method: We propose the first benchmark mutation framework grounded in empirical developer interaction analysis, integrating telemetry data, query reformulation, and natural language transformation techniques to convert formal problem statements into dialogue-style queries that faithfully reflect authentic user intent. Contribution/Results: Experiments across multiple public and private benchmarks reveal that conventional benchmarks overestimate agent performance by over 50%. Our work not only identifies the root causes of evaluation bias but also establishes a novel, interaction-aware evaluation paradigm for conversational programming agents—substantially enhancing assessment fidelity and validity.

Technology Category

Application Category

📝 Abstract
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.
Problem

Research questions and friction points this paper is trying to address.

Transforms GitHub issues into realistic developer queries
Addresses overestimation of agent capabilities in real scenarios
Creates new evaluation paradigm for chat-based coding assistants
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transforms formal benchmarks into realistic user queries
Mutates GitHub issues using developer interaction patterns
Applies telemetry analysis to create chat-style evaluations
🔎 Similar Papers
No similar papers found.
Spandan Garg
Spandan Garg
Microsoft
Agentic AIAgents for CodeAI for Software EngineeringArtificial IntelligenceMachine Learning
B
Ben Steenhoek
Microsoft Corporation
Y
Yufan Huang
Microsoft Corporation