Execution-Feedback Driven Test Generation from SWE Issues

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

When target code is missing or erroneous, generating reproducible test cases becomes challenging due to the absence of a correct oracle. To address this, this paper proposes an execution-feedback-driven test generation method that, without relying on a correct implementation, dynamically captures runtime behavioral deviations and guides test inputs toward conditions triggering SWE (Software Engineering) issues via repair-oriented constraint solving. Implemented in the custom tool e-Otter++, the approach overcomes the traditional limitation of requiring correct-code execution feedback. Evaluated on the TDD-Bench Verified benchmark, it achieves an average failure-to-pass (F2P) rate of 63%, significantly outperforming state-of-the-art techniques. Its core contribution is the first construction of a closed-loop execution feedback mechanism specifically designed for scenarios involving erroneous or missing code—enabling high-precision, robust reproduction of SWE issues through automatically generated test cases.

Technology Category

Application Category

📝 Abstract

A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically. The primary challenge in this setting is that the code to be tested is either missing or wrong, as evidenced by the existence of the issue in the first place. This has held back test generation for this setting: without the correct code to execute, it is difficult to leverage execution feedback to generate good tests. This paper introduces novel techniques for leveraging execution feedback to get around this problem, implemented in a new reproduction test generator called e-Otter++. Experiments show that e-Otter++ represents a leap ahead in the state-of-the-art for this problem, generating tests with an average fail-to-pass rate of 63% on the TDD-Bench Verified benchmark.

Problem

Research questions and friction points this paper is trying to address.

Generating reproduction tests for SWE issues automatically

Overcoming missing or incorrect code in test generation

Leveraging execution feedback without correct code

Innovation

Methods, ideas, or system contributions that make the work stand out.

Execution-feedback driven test generation

Novel techniques for missing code

e-Otter++ test generator implementation

🔎 Similar Papers

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark