🤖 AI Summary
This study investigates the capability of large language models (LLMs) to autonomously generate regression tests for structured-input programs—such as XML parsers and JavaScript interpreters—without syntactic prior knowledge, supporting bug detection and patch validation.
Method: We propose Cleverest, the first zero-shot, feedback-driven LLM-based test generation framework, integrating code-diff-aware semantic understanding, multi-round interactive refinement, and structured-input modeling.
Contribution/Results: Evaluated on 22 real-world commits across MuJS, Libxml2, and Poppler, Cleverest generates high-success-rate, defect-revealing tests for XML/JS formats within an average of three minutes. For PDF-related formats—where structural complexity limits direct effectiveness—the generated inputs serve as high-quality seeds for grey-box fuzzing. This work provides the first systematic empirical validation that zero-shot LLMs can produce human-understandable, reproducible, and semantically meaningful test cases for real-world structured-input programs.
📝 Abstract
Large Language Models (LLMs) have shown tremendous promise in automated software engineering. In this paper, we investigate the opportunities of LLMs for automatic regression test generation for programs that take highly structured, human-readable inputs, such as XML parsers or JavaScript interpreters. Concretely, we explore the following regression test generation scenarios for such programs that have so far been difficult to test automatically in the absence of corresponding input grammars: $ullet$ Bug finding. Given a code change (e.g., a commit or pull request), our LLM-based approach generates a test case with the objective of revealing any bugs that might be introduced if that change is applied. $ullet$ Patch testing. Given a patch, our LLM-based approach generates a test case that fails before but passes after the patch. This test can be added to the regression test suite to catch similar bugs in the future. We implement Cleverest, a feedback-directed, zero-shot LLM-based regression test generation technique, and evaluate its effectiveness on 22 commits to three subject programs: Mujs, Libxml2, and Poppler. For programs using more human-readable file formats, like XML or JavaScript, we found Cleverest performed very well. It generated easy-to-understand bug-revealing or bug-reproduction test cases for the majority of commits in just under three minutes -- even when only the code diff or commit message (unless it was too vague) was given. For programs with more compact file formats, like PDF, as expected, it struggled to generate effective test cases. However, the LLM-supplied test cases are not very far from becoming effective (e.g., when used as a seed by a greybox fuzzer or as a starting point by the developer).