Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment

📅 2026-02-28

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

This work addresses the long-standing challenge in automated program repair (APR) of evaluating patch correctness, which has traditionally relied on time-consuming and subjective manual validation, while existing automated approaches struggle with semantic redundancy and multiple valid solutions. To overcome these limitations, we propose Historian, a novel framework that introduces, for the first time, a multi-reference semantic comparison mechanism grounded in a knowledge base of historically validated patches. Leveraging large language models, Historian produces traceable, evidence-driven assessments and conservatively labels genuinely novel patches as “unknown” for human review. Experimental results demonstrate that Historian achieves 95.0% coverage and 88.4% accuracy under leave-one-out evaluation, reducing manual verification needs to just 5%. When deployed as a pre-filter, it boosts the accuracy of existing APR tools by up to 21.8%, and a hybrid pipeline attains 86.2% overall accuracy with 100% coverage.

Technology Category

Application Category

📝 Abstract

Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive models that treat each patch as novel, repeatedly re-assessing semantically redundant patches. Our analysis of a large corpus of tool-generated patches reveals a duality: about 39% of unique correct patches are syntactic clones, suggesting opportunities for automation, yet about 65% of bugs have multiple distinct correct fixes, making single-reference assessment insufficient. We present Historian, a framework that leverages Large Language Models to perform multi-reference comparisons against a knowledge base of historically validated patches, producing traceable, evidence-based verdicts while conservatively isolating novel cases as Unknown. In leave-one-tool-out evaluation, Historian achieves 95.0% coverage with 88.4% accuracy, reducing manual validation to 5% of patches. As an evidence-based pre-filter, enhancing the accuracy of standalone APCA tools by up to 21.8% and enabling a hybrid pipeline with 86.2% overall accuracy and 100% coverage. A longitudinal analysis of tool-generated patches (2020-2024) shows that redundancy in repair attempts is common, indicating that many patches repeatedly rediscover established ones and strengthening the sustainability of evidence-based APR assessment.

Problem

Research questions and friction points this paper is trying to address.

Automated Program Repair

Patch Correctness Assessment

Manual Validation

Semantic Redundancy

Evidence-Based Assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated Program Repair

Patch Correctness Assessment

Large Language Models