RM -RF: Reward Model for Run-Free Unit Test Evaluation

📅 2026-01-19
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes the first lightweight reward model capable of predicting multidimensional test quality—including executability, code coverage, and mutation kill rate—without executing tests, thereby circumventing the high latency and resource overhead inherent in traditional unit test evaluation that relies on repeated compilation and execution. Leveraging a multilingual dataset constructed from Java, Python, and Go, the model is trained using parameter-efficient strategies such as zero-shot inference, full fine-tuning, and LoRA. It achieves an average F1 score of 0.69 across all three quality metrics. By eliminating the need for actual test runs, this approach substantially reduces evaluation cost and latency, offering an efficient foundation for large-scale test generation and reinforcement learning–based test optimization.

Technology Category

Application Category

📝 Abstract
We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.
Problem

Research questions and friction points this paper is trying to address.

unit test evaluation
run-free
reward model
code coverage
mutation testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

reward model
run-free evaluation
unit test generation
code coverage prediction
mutation testing
Elena Bruches
Elena Bruches
Старший преподаватель, Новосибирский Государственный Университет
обработка естественных языков
D
D. Grebenkin
Siberian Neuronets LLC, Novosibirsk, Russia
M
Mikhail Klementev
Siberian Neuronets LLC, Novosibirsk, Russia
V
Vadim Alperovich
T-Technologies, Moscow, Russia
R
Roman Derunets
Siberian Neuronets LLC, Novosibirsk, Russia
D
Dari Baturova
Siberian Neuronets LLC, Novosibirsk, Russia
G
Georgy Mkrtchyan
T-Technologies, Moscow, Russia
O
Oleg Sedukhin
Siberian Neuronets LLC, Novosibirsk, Russia
Ivan Bondarenko
Ivan Bondarenko
Researcher, Laboratory of Applied Digital Technologies, Novosibirsk State University
Deep LearningNatural Language ProcessingAutomatic Speech RecognitionAutomated Machine LearningFew-Shot Learning
N
Nikolay Bushkov
T-Technologies, Moscow, Russia
Stanislav Moiseev
Stanislav Moiseev
T-Technologies
computer scienceAImathematics