Calibration of Large Language Models on Code Summarization

📅 2024-04-30

📈 Citations: 11

✨ Influential: 0

career value

179K/year

🤖 AI Summary

To address the challenge of assessing LLM-generated code summaries in the absence of human-written reference summaries, this paper pioneers a reference-free calibration framework for summary quality evaluation. We propose a confidence quantification method that directly predicts the similarity between an LLM-generated summary and a hypothetical human-written summary—without requiring gold-standard references—by jointly modeling code and its generated summary. Our approach integrates multi-LLM output analysis, semantic embedding comparison (via an enhanced BERTScore), confidence regression modeling, and preference-aligned training grounded in human judgments. The method demonstrates strong generalization across models, programming languages, and application scenarios. Evaluated on Python, Java, and Go datasets, it achieves Pearson correlation coefficients exceeding 0.82 with human judgments and improves calibration accuracy by 23.6% on average over state-of-the-art baselines.

Technology Category

Application Category

📝 Abstract

A brief, fluent, and relevant summary can be helpful during program comprehension; however, such a summary does require significant human effort to produce. Often, good summaries are unavailable in software projects, which makes maintenance more difficult. There has been a considerable body of research into automated AI-based methods, using Large Language models (LLMs), to generate summaries of code; there also has been quite a bit of work on ways to measure the performance of such summarization methods, with special attention paid to how closely these AI-generated summaries resemble a summary a human might have produced. Measures such as BERTScore and BLEU have been suggested and evaluated with human-subject studies. However, LLM-generated summaries can be inaccurate, incomplete, etc.: generally, too dissimilar to one that a good developer might write. Given an LLM-generated code summary, how can a user rationally judge if a summary is sufficiently good and reliable? Given just some input source code, and an LLM-generated summary, existing approaches can help judge brevity, fluency and relevance of the summary; however, it's difficult to gauge whether an LLM-generated summary sufficiently resembles what a human might produce, without a"golden"human-produced summary to compare against. We study this resemblance question as calibration problem: given just the code&the summary from an LLM, can we compute a confidence measure, that provides a reliable indication of whether the summary sufficiently resembles what a human would have produced in this situation? We examine this question using several LLMs, for several languages, and in several different settings. Our investigation suggests approaches to provide reliable predictions of the likelihood that an LLM-generated summary would sufficiently resemble a summary a human might write for the same code.

Problem

Research questions and friction points this paper is trying to address.

Calibrating LLMs to assess code summary resemblance to human-written ones

Evaluating reliability of LLM-generated code summaries without human references

Developing confidence measures for human-like code summarization by LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Calibration of LLMs for code summarization

Confidence measure for human-like summaries

Evaluation across multiple languages and settings

🔎 Similar Papers

Source Code Summarization in the Era of Large Language Models