Fine-grained Approaches for Confidence Calibration of LLMs in Automated Code Revision

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the lack of fine-grained confidence calibration in large language models (LLMs) for automated code repair, which hinders developers’ ability to assess the reliability of model outputs. The study introduces fine-grained confidence calibration into this domain for the first time, applying localized Platt scaling to three distinct types of local edit confidence scores and integrating them with global calibration. This hybrid approach overcomes the limitations of conventional global calibration methods, which fail to capture uncertainty at the level of individual editing decisions. Extensive experiments across three code repair tasks and fourteen LLMs demonstrate that the proposed method significantly reduces calibration error, particularly improving calibration quality over a broader range of predicted probabilities, thereby enhancing the trustworthiness of model-generated code revisions.

Technology Category

Application Category

📝 Abstract

In today's AI-assisted software engineering landscape, developers increasingly depend on LLMs that are highly capable, yet inherently imperfect. The tendency of these models to produce incorrect outputs can reduce developer productivity. To this end, a canonical mitigation method is to provide calibrated confidence scores that faithfully reflect their likelihood of correctness at the instance-level. Such information allows users to make immediate decisions regarding output acceptance, abstain error-prone outputs, and better align their expectations with the model's capabilities. Since post-trained LLMs do not inherently produce well-calibrated confidence scores, researchers have developed post-hoc calibration methods, with global Platt-scaling of sequence-level confidence scores proving effective in many generative software engineering tasks but remaining unreliable or unexplored for automated code revision (ACR) tasks such as program repair, vulnerability repair, and code refinement. We hypothesise that the coarse-grained nature of this conventional method makes it ill-suited for ACR tasks, where correctness is often determined by local edit decisions and miscalibration can be sample-dependent, thereby motivating fine-grained confidence calibration. To address this, our study proposes local Platt-scaling applied separately to three different fine-grained confidence scores. Through experiments across 3 separate tasks and correctness metrics, as well as 14 different models of various sizes, we find that fine-grained confidence scores consistently achieve lower calibration error across a broader range of probability intervals, and this effect is further amplified when global Platt-scaling is applied. Our proposed approaches offer a practical solution to eliciting well-calibrated confidence scores, enabling more trustworthy and streamlined usage of imperfect models in ACR tasks.

Problem

Research questions and friction points this paper is trying to address.

confidence calibration

large language models

automated code revision

code repair

model reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained calibration

confidence calibration

automated code revision