Mokav: Execution-driven Differential Testing with LLMs

📅 2024-06-14
🏛️ Journal of Systems and Software
📈 Citations: 3
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of detecting functional differences between two program versions by automatically generating difference-exposing test cases (DETs). We propose an execution-feedback-driven iterative generation method leveraging large language models (LLMs), introducing a novel mechanism that dynamically incorporates runtime execution outcomes into prompt engineering to enable closed-loop test-case optimization. Our approach integrates differential testing, Python program analysis, and dynamic feedback–augmented prompting. Evaluated on 1,535 program pairs from Codeforces, our method achieves an 81.7% DET generation rate—substantially outperforming Pynguin (4.9%) and Differential Prompting (37.3%). These results empirically validate the effectiveness and generalizability of execution-driven LLM prompting for program behavior discrepancy detection.

Technology Category

Application Category

📝 Abstract
It is essential to detect functional differences in various software engineering tasks, such as automated program repair, mutation testing, and code refactoring. The problem of detecting functional differences between two programs can be reduced to searching for a difference exposing test (DET): a test input that results in different outputs on the subject programs. In this paper, we propose Mokav, a novel execution-driven tool that leverages LLMs to generate DETs. Mokav takes two versions of a program (P and Q) and an example test input. When successful, Mokav generates a valid DET, a test input that leads to different outputs on P and Q. Mokav iteratively prompts an LLM with a specialized prompt to generate new test inputs. At each iteration, Mokav provides execution-based feedback regarding previously generated tests until the LLM produces a DET. We evaluate Mokav on 1,535 pairs of Python programs collected from the Codeforces competition platform and 32 pairs of programs from the QuixBugs dataset. Our experiments show that Mokav outperforms the state-of-the-art, Pynguin and Differential Prompting, by a large margin. Mokav can generate DETs for 81.7% (1,255/1,535) of the program pairs in our benchmark (versus 4.9% for Pynguin and 37.3% for Differential Prompting). We demonstrate that all components in our system, including the iterative and execution-driven approaches, contribute to its high effectiveness.
Problem

Research questions and friction points this paper is trying to address.

Detect functional differences between program versions
Generate difference-exposing test inputs using LLMs
Improve effectiveness over existing differential testing tools
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLMs to generate difference exposing tests
Uses execution-driven feedback for iterative test refinement
Specialized prompts guide LLM to produce valid DETs
🔎 Similar Papers
No similar papers found.