🤖 AI Summary
This work addresses the deployment reliability challenges of current code language models, which often suffer from overconfidence or underconfidence due to the absence of effective uncertainty estimation and active abstention mechanisms. The authors propose a unified, deployment-oriented framework that treats uncertainty as an actionable signal, jointly optimizing model calibration, selective prediction, and lightweight program analysis tool invocation to establish an end-to-end decision-making and repair pipeline. Evaluated on both classification and generation tasks, the approach enables risk-controlled, coverage-adjustable applications, significantly improving correctness ranking and selective prediction performance while maintaining high coverage. This leads to a substantial reduction in error rates and enhances the practical reliability of code language models in real-world scenarios.
📝 Abstract
Code language models are increasingly adopted for both understanding and generative tasks. Despite their success, these models frequently produce overconfident incorrect predictions and underconfident correct predictions, undermining their reliability in deployment. Practical deployment demands three capabilities: accurately estimating the likelihood of correctness, abstaining on uncertain predictions, and invoking external mechanisms to validate or repair abstained outputs. Existing calibration and uncertainty estimation methods, primarily developed for natural language tasks, do not readily transfer to code. Notably, post-hoc calibration techniques often reduce probability misalignment but fail to improve the ranking of predictions by correctness likelihood-a requirement for selective prediction under partial coverage. Furthermore, most approaches treat uncertainty as a passive indicator rather than an actionable signal. This work introduces a unified framework that integrates uncertainty estimation, model calibration, and tool-based abstention handling for code models. The proposed design enables models to assign reliable correctness probabilities, abstain under uncertainty, and invoke lightweight program analysis procedures to process abstained cases. By combining these components within a single deployment-oriented workflow, this framework supports risk-aware, coverage-controlled use of code models across both classification and generation settings.