HarnessLLM: Automatic Testing Harness Generation via Reinforcement Learning

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing LLM-based automated test generation primarily produces static input-output assertion pairs, resulting in limited test diversity and insufficient debugging information. This work proposes a novel paradigm for generating executable test harnesses—supporting dynamic input construction and flexible output validation (e.g., invariant checking). Methodologically, we design a two-stage training framework: first, supervised fine-tuning (SFT) to teach LLMs the structural conventions of test scripts; second, reinforcement learning with a custom reward function (RLVR) to optimize test quality along dimensions such as correctness, coverage, and verifiability. Empirical evaluation demonstrates substantial improvements in defect detection rate and test strategy diversity; moreover, the generated harnesses support runtime extension to further enhance code generation fidelity. Our core contribution is the first systematic advancement of LLM-driven test generation—from static assertion pairs to fully executable, formally verifiable, and extensible test programs.

Technology Category

Application Category

📝 Abstract

Existing LLM-based automatic test generation methods mainly produce input and expected output pairs to categorize the intended behavior of correct programs. Although straightforward, these methods have limited diversity in generated tests and cannot provide enough debugging information. We propose HarnessLLM, a two-stage training pipeline that enables LLMs to write harness code for testing. Particularly, LLMs generate code that synthesizes inputs and validates the observed outputs, allowing complex test cases and flexible output validation such as invariant checking. To achieve this, we train LLMs with SFT followed by RLVR with a customized reward design. Experiments show that HarnessLLM outperforms input-output-based testing in bug finding and testing strategy diversity. HarnessLLM further benefits the code generation performance through test-time scaling with our generated test cases as inference-phase validation. Our code is available at https://github.com/UCSB-NLP-Chang/HarnessLLM.git.

Problem

Research questions and friction points this paper is trying to address.

Generating diverse test cases beyond simple input-output pairs

Providing comprehensive debugging information for program validation

Creating harness code that synthesizes inputs and validates outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates harness code via two-stage training pipeline

Uses SFT followed by RLVR with customized rewards

Enables complex test cases with invariant checking

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation