Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Large language models (LLMs) frequently generate code containing subtle, implementation-level bugs that evade detection in standard benchmark evaluations and are difficult for the models to self-diagnose. Method: We propose a summary-driven program repair approach that leverages LLMs’ inherent tendency to abstract away low-level noise and focus on high-level intent when generating natural language code summaries. We introduce *error-aware diagnostic summaries*—a novel summary style explicitly designed to highlight latent bugs and yield interpretable repair cues—and integrate them as an explicit intermediary step for error localization and correction. Using prompt engineering, we implement a two-stage repair pipeline that fuses multiple summary styles. Contribution/Results: Evaluated across eight LLMs on HumanEvalPack and MBPP, our method achieves an average 5% absolute improvement in pass@1 repair rate over direct repair baselines, successfully fixing 65% of previously undetected bugs. Results demonstrate that code summarization serves as an effective, lightweight, and integrable diagnostic tool for LLM-based program repair.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) often produce code with subtle implementation-level bugs despite strong benchmark performance. These errors are hard for LLMs to spot and can have large behavioural effects; yet when asked to summarise code, LLMs can frequently surface high-level intent and sometimes overlook this low-level noise. Motivated by this, we propose summary-mediated repair, a prompt-only pipeline for program repair that leverages natural-language code summarisation as an explicit intermediate step, extending previous work that has already shown code summarisation to be a useful intermediary for downstream tasks. We evaluate our method across eight production-grade LLMs on two function level benchmarks (HumanEvalPack and MBPP), comparing several summary styles against a direct repair baseline. Error-aware diagnostic summaries consistently yield the largest gains - repairing up to 65% of unseen errors, on average of 5% more than the baseline - though overall improvements are modest and LLM-dependent. Our results position summaries as a cheap, human-interpretable diagnostic artefact that can be integrated into program-repair pipelines rather than a stand-alone fix-all.

Problem

Research questions and friction points this paper is trying to address.

LLMs generate code with subtle implementation-level bugs despite strong performance

Existing methods struggle to detect low-level errors that significantly impact behavior

Program repair needs better diagnostic tools beyond direct code generation approaches

Innovation

Methods, ideas, or system contributions that make the work stand out.

Using code summarization as intermediate repair step

Error-aware diagnostic summaries yield largest repair gains

Integrating summaries as cheap human-interpretable diagnostic artifacts

🔎 Similar Papers

A Systematic Literature Review on Large Language Models for Automated Program Repair