🤖 AI Summary
Existing software engineering agent benchmarks (e.g., SWE-Bench Verified) rely on GitHub issue descriptions, which diverge significantly from real-world developer interactions with conversational coding assistants inside IDEs—leading to systematic overestimation of capabilities such as bug fixing.
Method: We propose the first benchmark mutation framework grounded in empirical developer interaction analysis, integrating telemetry data, query reformulation, and natural language transformation techniques to convert formal problem statements into dialogue-style queries that faithfully reflect authentic user intent.
Contribution/Results: Experiments across multiple public and private benchmarks reveal that conventional benchmarks overestimate agent performance by over 50%. Our work not only identifies the root causes of evaluation bias but also establishes a novel, interaction-aware evaluation paradigm for conversational programming agents—substantially enhancing assessment fidelity and validity.
📝 Abstract
Current benchmarks for evaluating software engineering agents, such as SWE-Bench Verified, are predominantly derived from GitHub issues and fail to accurately reflect how developers interact with chat-based coding assistants in integrated development environments (IDEs). We posit that this mismatch leads to a systematic overestimation of agent's capabilities in real-world scenarios, especially bug fixing. We introduce a novel benchmarking framework that transforms existing formal benchmarks into realistic user queries through systematic analysis of developer interaction patterns with chat-based agents. Our methodology is flexible and can be easily extended to existing benchmarks. In this paper, we apply our testing framework to SWE-Bench Verified, the TypeScript subset of Multi-SWE-Bench and a private benchmark, SWE-Bench C# and transform formal GitHub issue descriptions into realistic user-style queries based on telemetry analysis of a popular chat-based agent interactions. Our findings reveal that existing benchmarks significantly overestimate agent capabilities for some models by >50% over baseline performance for public benchmarks and ~10-16% for our internal benchmark. This work establishes a new paradigm for evaluating interactive chat-based software engineering agents through benchmark mutation techniques.