Towards a Human-in-the-Loop Framework for Reliable Patch Evaluation Using an LLM-as-a-Judge

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated Program Repair (APR) commonly relies on execution-based metrics—such as unit test pass rates—to assess patch correctness, which often fails to capture semantic correctness; manual validation, while reliable, is prohibitively expensive. Method: This paper proposes a human-in-the-loop LLM evaluation framework: first, an LLM generates defect-specific scoring rules for each bug; these rules undergo one-time human review and are then consolidated into a shared, reusable rule set; subsequently, the LLM automatically classifies patch validity using this fixed rule set. Contribution/Results: Our approach introduces the novel “shared scoring rules” mechanism, enhancing interpretability while drastically reducing human effort. Experiments show strong agreement with human consensus—Cohen’s kappa = 0.75, precision = 80%, recall = 94%—on validated patches; on the full dataset, kappa = 0.57, demonstrating both assessment reliability and generalizability.

Technology Category

Application Category

📝 Abstract
Reliable evaluation is crucial for advancing Automated Program Repair (APR), but prevailing benchmarks rely on execution-based evaluation methods (unit test pass@k), which fail to capture true patch validity. Determining validity can require costly manual annotation. To reduce this cost, we introduce a human-in-the-loop approach to LLM-based patch validity judgment. Inspired by the observation that human judgment is better aligned when using a shared rubric, we first employ an LLM to generate a per-bug rubric, followed by a one-time human review and optional refinement to this rubric, and then employ an LLM to judge patches using the refined rubric. We apply this approach to assign binary validity labels to patches for issues found by Google sanitizer tools. Our results show that this approach yields substantial agreement with human consensus (Cohen's kappa 0.75), high recall (0.94) and high precision (0.80), when considering patches that have unanimous agreement from 3 human raters on the validity labels. On the full dataset including patches where human raters disagree, we find this approach can still be further improved (Cohen's kappa 0.57, recall 0.93, precision 0.65) and identify possible future directions.
Problem

Research questions and friction points this paper is trying to address.

Automated Program Repair evaluation lacks reliable patch validity assessment
Manual patch validation requires costly human annotation efforts
Current benchmarks fail to capture true patch correctness accurately
Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-in-the-loop framework for LLM-based patch evaluation
LLM generates per-bug rubric with human refinement
Uses refined rubric for automated patch validity judgment