🤖 AI Summary
This work proposes the first lightweight reward model capable of predicting multidimensional test quality—including executability, code coverage, and mutation kill rate—without executing tests, thereby circumventing the high latency and resource overhead inherent in traditional unit test evaluation that relies on repeated compilation and execution. Leveraging a multilingual dataset constructed from Java, Python, and Go, the model is trained using parameter-efficient strategies such as zero-shot inference, full fine-tuning, and LoRA. It achieves an average F1 score of 0.69 across all three quality metrics. By eliminating the need for actual test runs, this approach substantially reduces evaluation cost and latency, offering an efficient foundation for large-scale test generation and reinforcement learning–based test optimization.
📝 Abstract
We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.